Re: joins and filter queries effecting scoring

2011-10-28 Thread Martijn v Groningen
Have your tried using the join in the fq instead of the q?
Like this (assuming user_id_i is a field in the post document type and
self_id_i a field in the user document type):
q=posts_text:hellofq={!join from=self_id_i
to=user_id_i}is_active_boolean:true

In this example the fq produces a docset that contains all user
documents that are active. This docset is used as filter during the
execution of the main query (q param),
so it only returns posts with the contain the text hello for active users.

Martijn

On 28 October 2011 01:57, Jason Toy jason...@gmail.com wrote:
 Does anyone have any idea on this issue?

 On Tue, Oct 25, 2011 at 11:40 AM, Jason Toy jason...@gmail.com wrote:

 Hi Yonik,

 Without a Join I would normally query user docs with:
 q=data_text:testfq=is_active_boolean:true

 With joining users with posts, I get no no results:
 q={!join from=self_id_i
 to=user_id_i}data_text:testfq=is_active_boolean:truefq=posts_text:hello



 I am able to use this query, but it gives me the results in an order that I
 dont want(nor do I understand its order):
 q={!join from=self_id_i to=user_id_i}data_text:test AND
 is_active_boolean:truefq=posts_text:hello

 I want the order to be the same as I would get from my original
 q=data_text:testfq=is_active_boolean:true, but with the ability to join
 with the Posts docs.





 On Tue, Oct 25, 2011 at 11:30 AM, Yonik Seeley yo...@lucidimagination.com
  wrote:

 Can you give an example of the request (URL) you are sending to Solr?

 -Yonik
 http://www.lucidimagination.com



 On Mon, Oct 24, 2011 at 3:31 PM, Jason Toy jason...@gmail.com wrote:
  I have 2 types of docs, users and posts.
  I want to view all the docs that belong to certain users by joining
 posts
  and users together.  I have to filter the users with a filter query of
  is_active_boolean:true so that the score is not effected,but since I
 do a
  join, I have to move the filter query to the query parameter so that I
 can
  get the filter applied. The problem is that since the is_active_boolean
 is
  moved to the query, the score is affected which returns back an order
 that I
  don't want.
   If I leave the is_active_boolean:true in the fq paramater, I get no
  results back.
 
  My question is how can I apply a filter query to users so that the score
 is
  not effected?
 




 --
 - sent from my mobile





 --
 - sent from my mobile




-- 
Met vriendelijke groet,

Martijn van Groningen


Always return total number of documents

2011-10-28 Thread Robert Brown
Currently I'm making 2 calls to Solr to be able to state matched 20 
out of 200 documents.


Is there no way to return the total number of docs as part of a 
search?



--

IntelCompute
Web Design  Local Online Marketing

http://www.intelcompute.com



Re: Always return total number of documents

2011-10-28 Thread Michael Kuhlmann
Am 28.10.2011 11:16, schrieb Robert Brown:
 Is there no way to return the total number of docs as part of a search?

No, it isn't. Usually this information is of absolutely no value to the
end user.

A workaround would be to add some field to the schema that has the same
value for every document, and use this for facetting.

Greetings,
Kuli


Re: Always return total number of documents

2011-10-28 Thread Robert Brown
Cheers Kuli,

This is actually of huge importance to our customers, to see how many
documents we store.

The faceting option sounds a bit messy, maybe we'll have to stick with
2 queries.


---

IntelCompute
Web Design  Local Online Marketing

http://www.intelcompute.com

On Fri, 28 Oct 2011 11:43:11 +0200, Michael Kuhlmann k...@solarier.de
wrote:
 Am 28.10.2011 11:16, schrieb Robert Brown:
 Is there no way to return the total number of docs as part of a search?
 
 No, it isn't. Usually this information is of absolutely no value to the
 end user.
 
 A workaround would be to add some field to the schema that has the same
 value for every document, and use this for facetting.
 
 Greetings,
 Kuli



Re: Too many values for UnInvertedField faceting on field autocompleteField

2011-10-28 Thread Torsten Krah
Am Mittwoch, den 26.10.2011, 08:02 -0400 schrieb Yonik Seeley:
 You can also try adding facet.method=enum directly to your request

Added 

  query.set(facet.method, enum);

to my solr query at code level and now it works. Don't know why the
handler stuff gets ignored or overriden, but its ok for my usecase to
specify it at query level.

thx

Torsten


smime.p7s
Description: S/MIME cryptographic signature


Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Ian Grainger
Hi, I'm using Grouping with group.truncate=true, The following simple facet
query:

facet.query=Monitor_id:[38 TO 40]

Doesn't give the same number as the nGroups result (with
grouping.ngroups=true) for the equivalent filter query:

fq=Monitor_id:[38 TO 40]

I thought they should be the same - from the Wiki page: 'group.truncate: If
true, facet counts are based on the most relevant document of each group
matching the query.'

What am I doing wrong?

If I turn off group.truncate then the counts are the same, as I'd expect -
but unfortunately I'm only interested in the grouped results.

- I have also asked this question on StackOverflow, here:
http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries

Thanks!

-- 
Ian

i...@isfluent.com a...@endissolutions.com
+44 (0)1223 257903


Re: changing omitNorms on an already built index

2011-10-28 Thread Simon Willnauer
On Fri, Oct 28, 2011 at 12:20 AM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Oct 27, 2011 at 6:00 PM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 we are not actively removing norms. if you set omitNorms=true and
 index documents they won't have norms for this field. Yet, other
 segment still have norms until they get merged with a segment that has
 no norms for that field ie. omits norms. omitNorms is anti-viral so
 once you set it to true it will be true for other segment eventually.
 If you optimize you index you should see that norms go away.


 This is only true in trunk (4.x!)
 https://issues.apache.org/jira/browse/LUCENE-2846

ah right, I thought this was ported - nevermind! thanks robert

simon

 --
 lucidimagination.com



Re: Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Martijn v Groningen
Hi Ian,

I think this is a bug. After looking into the code the facet.query
feature doesn't take into account the group.truncate option.
This needs to be fixed. You can open a new issue in Jira if you want to.

Martijn

On 28 October 2011 12:09, Ian Grainger i...@isfluent.com wrote:
 Hi, I'm using Grouping with group.truncate=true, The following simple facet
 query:

 facet.query=Monitor_id:[38 TO 40]

 Doesn't give the same number as the nGroups result (with
 grouping.ngroups=true) for the equivalent filter query:

 fq=Monitor_id:[38 TO 40]

 I thought they should be the same - from the Wiki page: 'group.truncate: If
 true, facet counts are based on the most relevant document of each group
 matching the query.'

 What am I doing wrong?

 If I turn off group.truncate then the counts are the same, as I'd expect -
 but unfortunately I'm only interested in the grouped results.

 - I have also asked this question on StackOverflow, here:
 http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries

 Thanks!

 --
 Ian

 i...@isfluent.com a...@endissolutions.com
 +44 (0)1223 257903




-- 
Met vriendelijke groet,

Martijn van Groningen


Re: Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Ian Grainger
Thanks, Marijn. I have logged the bug here:
https://issues.apache.org/jira/browse/SOLR-2863

Is there any chance of a workaround for this issue before the bug is fixed?

If you want to answer the question on StackOverflow:
http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries
I'll
accept your answer.


On Fri, Oct 28, 2011 at 12:14 PM, Martijn v Groningen 
martijn.v.gronin...@gmail.com wrote:

 Hi Ian,

 I think this is a bug. After looking into the code the facet.query
 feature doesn't take into account the group.truncate option.
 This needs to be fixed. You can open a new issue in Jira if you want to.

 Martijn

 On 28 October 2011 12:09, Ian Grainger i...@isfluent.com wrote:
  Hi, I'm using Grouping with group.truncate=true, The following simple
 facet
  query:
 
  facet.query=Monitor_id:[38 TO 40]
 
  Doesn't give the same number as the nGroups result (with
  grouping.ngroups=true) for the equivalent filter query:
 
  fq=Monitor_id:[38 TO 40]
 
  I thought they should be the same - from the Wiki page: 'group.truncate:
 If
  true, facet counts are based on the most relevant document of each group
  matching the query.'
 
  What am I doing wrong?
 
  If I turn off group.truncate then the counts are the same, as I'd expect
 -
  but unfortunately I'm only interested in the grouped results.
 
  - I have also asked this question on StackOverflow, here:
 
 http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries
 
  Thanks!
 
  --
  Ian
 
  i...@isfluent.com a...@endissolutions.com
  +44 (0)1223 257903
 



 --
 Met vriendelijke groet,

 Martijn van Groningen




-- 
Ian

i...@isfluent.com a...@endissolutions.com
+44 (0)1223 257903


Solr Profiling

2011-10-28 Thread Rohit
Hi,

 

My Solr becomes very slow or hangs up at times, we have done almost
everything possible like

. Giving 16GB memory to JVM

. Sharding

 

But these help only for X time, i want to profile the server and see whats
going wrong? How can I profile solr remotely?

 

Regards,

Rohit

 



Re: solr break up word

2011-10-28 Thread Boris Quiroz
Hi Erick,

I'll try without the type=index on analyzer tag and then I'll
re-index some files.

Thanks for you answer.

On Thu, Oct 27, 2011 at 6:54 PM, Erick Erickson erickerick...@gmail.com wrote:
 Hmmm, I'm not sure what happens when you specify
 analyzer (without type=index and
 analyzer type=query. I have no clue which one
 is used.

 Look at the admin/analysis page to understand how things are
 broken up.

 Did you re-index after you added the ngram filter?

 You'll get better help if you include example queries with
 debugQuery=on appended, it'll give us a lot more to
 work with.

 Best
 Erick

 On Wed, Oct 26, 2011 at 4:14 PM, Boris Quiroz boris.qui...@menco.it wrote:
 Hi,

 I've solr running on a CentOS server working OK, but sometimes my 
 application needs to index some parts of a word. For example, if I search 
 'dislike' word fine but if I search 'disl' it returns zero. Also, if I 
 search 'disl*' returns some values (the same if I search for 'dislike') but 
 if I search 'dislike*' it returns zero too.

 So, I've two questions:

 1. How exactly the asterisk works as a wildcard?

 2. What can I do to index properly parts of a word? I added this lines to my 
 schema.xml:

 fieldType name=text class=solr.TextField omitNorms=false
      analyzer
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.StandardFilterFactory/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.NGramFilterFactory minGramSize=2 
 maxGramSize=15/
      /analyzer

      analyzer type=query
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.StandardFilterFactory/
        filter class=solr.LowerCaseFilterFactory/
      /analyzer
 /fieldType

 But I can't get it to work. Is OK what I did or I'm wrong?

 Thanks.

 --
 Boris Quiroz
 boris.qui...@menco.it






-- 
Boris Quiroz
boris.qui...@menco.it


Re: Collection Distribution vs Replication in Solr

2011-10-28 Thread Alireza Salimi
So I have to ask my question again.
Is there any reason not to use Replication in Solr and use Collection
Distribution?

Thanks

On Thu, Oct 27, 2011 at 5:33 PM, Alireza Salimi alireza.sal...@gmail.comwrote:

 I can't see those benchmarks, can you?

 On Thu, Oct 27, 2011 at 5:20 PM, Marc Sturlese marc.sturl...@gmail.comwrote:

 Replication is easier to manage and a bit faster. See the performance
 numbers: http://wiki.apache.org/solr/SolrReplication

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Collection-Distribution-vs-Replication-in-Solr-tp3458724p3459178.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Alireza Salimi
 Java EE Developer





-- 
Alireza Salimi
Java EE Developer


Re: Faceting on multiple fields, with multiple where clauses

2011-10-28 Thread Rubinho
Thank you Erik,
Now i understand the difference between Q and QF.

Unfortunately, there is 1 unsolved problem left (didn't find the answer
yesterday evening).

I added grouping on this query, because i want to show a group of trips with
the same code only once. (A trip has multiple departure days, and i just
want to show 1 trip, while in de detail screen, i'll show all the available
trips (departure dates).

When i don't filter by country, i receive all countries with their correct
count.
When i do a filter by country, the count of my countries isn't grouped
anymore

When i get the number of trips/month, i just get numbers for the next 2
months and no numbers for the other months (the trip should appear here each
time in a month, because they have departures in each)

Can you help me again?
I'll appreciate it very much :)

http://localhost:8080/solr/select?facet=truefacet.date={!ex=SD}StartDatef.StartDate.facet.date.start=2011-10-01T00:00:00Zf.StartDate.facet.date.end=2012-09-30T00:00:00Zf.StartDate.facet.date.gap=%2B1MONTHfacet.field={!ex=CC}CountryCoderows=0version=2.2q=*:*group=truegroup.field=RoundtripgroupCodegroup.truncate=true

These parts of the query are added when a selection is made:
fq={!tag=CC}CountryCode:CR
fq={!tag=SD}StartDate:[2011-10-01T00:00:00Z TO 2011-10-31T00:00:00Z]




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Faceting-on-multiple-fields-with-multiple-where-clauses-tp3457432p3460934.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: bbox issue

2011-10-28 Thread Yonik Seeley
Oops, didn't mean for this conversation to leave the mailing lists.

OK, so your lat and lon types were being stored as text but not
indexed (hence no search matches).
A dynamic field of * does tend to hide bugs/problems ;-)

 So should I have another for _latLon?  Would it look like:
 dynamicField name=*_latLon type=double indexed=true stored=true/

Yep.  It shouldn't be stored though (unless you just want to verify
for debugging).

-Yonik
http://www.lucidimagination.com



On Fri, Oct 28, 2011 at 9:35 AM, Christopher Gross cogr...@gmail.com wrote:
 Hi Yonik.

 I never made a dynamicField definition for _latLon ... I was following
 the examples on http://wiki.apache.org/solr/SpatialSearchDev, so I
 just added the field type definition, then the field in the list of
 fields.  I wasn't aware that I had to do anything else.  The only
 dynamic that I have is:
 dynamicField name=* type=text indexed=false stored=true
 multiValued=true/

 So should I have another for _latLon?  Would it look like:
 dynamicField name=*_latLon type=double indexed=true stored=true/

 -- Chris



 On Fri, Oct 28, 2011 at 9:27 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Oct 28, 2011 at 8:42 AM, Christopher Gross cogr...@gmail.com wrote:
 Hi Yonik.

 I'm having more of a problem now...
 I made the following lines in my schema.xml (in the appropriate places):

 fieldType name=location class=solr.LatLonType 
 subFieldSuffix=_latLon/

 field name=point type=location indexed=true stored=true
 required=false/

 I have data (did a q=*:*, found one with a point):
 str name=point48.306074,14.286293/str
 arr name=point_0_latLon
 str48.306074/str
 /arr
 arr name=point_1_latLon
 str14.286293/str
 /arr

 I've tried to do a bbox:
 q=*:*fq=point:[30.0,10.0%20TO%2050.0,20.0]
 q=*:*fq={!bbox}sfield=pointpt=48,14d=50

 And neither of those seem to find the point...

 Hmmm, what's the dynamicField definition for _latLon?  Is it indexed?
 If you add debugQuery=true, you should be able to see the underlying
 range queries for your explicit range query.

 -Yonik
 http://www.lucidimagination.com




Re: bbox issue

2011-10-28 Thread Christopher Gross
Ah!  That all makes sense.  The example on the SpacialSearchDev page
should have that bit added in!

I'm back in business now, thanks Yonik!

-- Chris



On Fri, Oct 28, 2011 at 9:40 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 Oops, didn't mean for this conversation to leave the mailing lists.

 OK, so your lat and lon types were being stored as text but not
 indexed (hence no search matches).
 A dynamic field of * does tend to hide bugs/problems ;-)

 So should I have another for _latLon?  Would it look like:
 dynamicField name=*_latLon type=double indexed=true stored=true/

 Yep.  It shouldn't be stored though (unless you just want to verify
 for debugging).

 -Yonik
 http://www.lucidimagination.com



 On Fri, Oct 28, 2011 at 9:35 AM, Christopher Gross cogr...@gmail.com wrote:
 Hi Yonik.

 I never made a dynamicField definition for _latLon ... I was following
 the examples on http://wiki.apache.org/solr/SpatialSearchDev, so I
 just added the field type definition, then the field in the list of
 fields.  I wasn't aware that I had to do anything else.  The only
 dynamic that I have is:
 dynamicField name=* type=text indexed=false stored=true
 multiValued=true/

 So should I have another for _latLon?  Would it look like:
 dynamicField name=*_latLon type=double indexed=true stored=true/

 -- Chris



 On Fri, Oct 28, 2011 at 9:27 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Oct 28, 2011 at 8:42 AM, Christopher Gross cogr...@gmail.com 
 wrote:
 Hi Yonik.

 I'm having more of a problem now...
 I made the following lines in my schema.xml (in the appropriate places):

 fieldType name=location class=solr.LatLonType 
 subFieldSuffix=_latLon/

 field name=point type=location indexed=true stored=true
 required=false/

 I have data (did a q=*:*, found one with a point):
 str name=point48.306074,14.286293/str
 arr name=point_0_latLon
 str48.306074/str
 /arr
 arr name=point_1_latLon
 str14.286293/str
 /arr

 I've tried to do a bbox:
 q=*:*fq=point:[30.0,10.0%20TO%2050.0,20.0]
 q=*:*fq={!bbox}sfield=pointpt=48,14d=50

 And neither of those seem to find the point...

 Hmmm, what's the dynamicField definition for _latLon?  Is it indexed?
 If you add debugQuery=true, you should be able to see the underlying
 range queries for your explicit range query.

 -Yonik
 http://www.lucidimagination.com





Re: solr break up word

2011-10-28 Thread Boris Quiroz
Hi,

I solved the issue. I added to my schema.xml the following lines:

analyzer
tokenizer class=solr.NGramTokenizerFactory minGramSize=3
maxGramSize=15 /
filter class=solr.LowerCaseFilterFactory/
...
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
...
/analyzer

Then, I re-index and everything is working great :-)

Thanks for your help.

On Fri, Oct 28, 2011 at 10:08 AM, Boris Quiroz boris.qui...@menco.it wrote:
 Hi Erick,

 I'll try without the type=index on analyzer tag and then I'll
 re-index some files.

 Thanks for you answer.

 On Thu, Oct 27, 2011 at 6:54 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Hmmm, I'm not sure what happens when you specify
 analyzer (without type=index and
 analyzer type=query. I have no clue which one
 is used.

 Look at the admin/analysis page to understand how things are
 broken up.

 Did you re-index after you added the ngram filter?

 You'll get better help if you include example queries with
 debugQuery=on appended, it'll give us a lot more to
 work with.

 Best
 Erick

 On Wed, Oct 26, 2011 at 4:14 PM, Boris Quiroz boris.qui...@menco.it wrote:
 Hi,

 I've solr running on a CentOS server working OK, but sometimes my 
 application needs to index some parts of a word. For example, if I search 
 'dislike' word fine but if I search 'disl' it returns zero. Also, if I 
 search 'disl*' returns some values (the same if I search for 'dislike') but 
 if I search 'dislike*' it returns zero too.

 So, I've two questions:

 1. How exactly the asterisk works as a wildcard?

 2. What can I do to index properly parts of a word? I added this lines to 
 my schema.xml:

 fieldType name=text class=solr.TextField omitNorms=false
      analyzer
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.StandardFilterFactory/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.NGramFilterFactory minGramSize=2 
 maxGramSize=15/
      /analyzer

      analyzer type=query
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.StandardFilterFactory/
        filter class=solr.LowerCaseFilterFactory/
      /analyzer
 /fieldType

 But I can't get it to work. Is OK what I did or I'm wrong?

 Thanks.

 --
 Boris Quiroz
 boris.qui...@menco.it






 --
 Boris Quiroz
 boris.qui...@menco.it




-- 
Boris Quiroz
boris.qui...@menco.it


Updating a document multi-value field (no dup values) without needed it to be already committed

2011-10-28 Thread Thibaut Colar

Sorry for the lengthy text, it's a bit difficult to explain:

We are using Solr to index some user info like username, email (among 
other things).


I'm also trying to use facets for search, so for example, I added a 
multi-value field to user called organizations where I would store the 
name of the organizations that user work for.


So i can use that field for facetted search and be able to filter a user 
search query result by the organizations this user work for.


So now, the issue I have is my code does something like: 1) Add users 
documents to Solr 2) When a user is assigned an organization 
membership(role), update the user doc to set the organizations field


Now I have the following issue with step 2: If I just do a 
addField(organizations, BigCorp) on the user doc, it will add that 
value regardless if organizations already have that value(BigCorp) or 
not, but I want each org name to appear only once.


So only way I found to get that behavior is to query the user document, 
get the values of organization and only add the new value if it's not 
already in there - if !userDoc.getValues(organiations).contains(value) 
{... add the value to the doc and save it ...}-


Now that works well, but only if I commit all the time(between step 1  
2 at least), because the document query will not work unless it has been 
committed already. Obviously in theory its best not to commit all the 
time performance-wise, and unpractical since I process those inserts in 
batches.


*So I guess the main issue would be:*

 *

   Is there a way to update a multi-value field, without allowing
   duplicates, that would not require querying the doc to manually
   prevent duplicates ?

 *

   Maybe some better way to do this ?

Thanks.



Re: Updating a document multi-value field (no dup values) without needed it to be already committed

2011-10-28 Thread Thibaut Colar

Related questions is:
Is there a way to update a doc to remove a specific value from a 
multi-value field (in my case remove a role)


I manage to do that by querying the doc and reading all the other values 
manually then saving, but that has the same issues and is inefficient.


On 10/28/11 10:04 AM, Thibaut Colar wrote:

Sorry for the lengthy text, it's a bit difficult to explain:

We are using Solr to index some user info like username, email (among 
other things).


I'm also trying to use facets for search, so for example, I added a 
multi-value field to user called organizations where I would store 
the name of the organizations that user work for.


So i can use that field for facetted search and be able to filter a 
user search query result by the organizations this user work for.


So now, the issue I have is my code does something like: 1) Add users 
documents to Solr 2) When a user is assigned an organization 
membership(role), update the user doc to set the organizations field


Now I have the following issue with step 2: If I just do a 
addField(organizations, BigCorp) on the user doc, it will add that 
value regardless if organizations already have that value(BigCorp) 
or not, but I want each org name to appear only once.


So only way I found to get that behavior is to query the user 
document, get the values of organization and only add the new value 
if it's not already in there - if 
!userDoc.getValues(organiations).contains(value) {... add the value 
to the doc and save it ...}-


Now that works well, but only if I commit all the time(between step 1 
 2 at least), because the document query will not work unless it has 
been committed already. Obviously in theory its best not to commit all 
the time performance-wise, and unpractical since I process those 
inserts in batches.


*So I guess the main issue would be:*

 *

   Is there a way to update a multi-value field, without allowing
   duplicates, that would not require querying the doc to manually
   prevent duplicates ?

 *

   Maybe some better way to do this ?

Thanks.






Recover index

2011-10-28 Thread Frederico Azeiteiro
Hello all,

 

When moving a SOLR index to another instance I lost the files:

segments.gen

segments_xk

 

I have the .cfs file complete.

 

What are my options to recover the data.

Any ideia that I can test?

 

Thank you.



Frederico Azeiteiro

 



Re: Query/Delete performance difference between straight HTTP and SolrJ

2011-10-28 Thread Shawn Heisey

On 10/27/2011 5:56 AM, Michael Sokolov wrote:
From everything you've said, it certainly sounds like a low-level I/O 
problem in the client, not a server slowdown of any sort.  Maybe Perl 
is using the same connection over and over (keep-alive) and Java is 
not.  I really don't know.  One thing I've heard is that 
StreamingUpdateSolrServer (I think that's what it's called) can give 
better throughput for large request batches.  If you're not using 
that, you may be having problems w/closing and re-opening connections?


I turned off the perl build system and had the Java program take over 
full build duties for both index chains.  It's been designed so one copy 
of the program can keep any number of index chains up to date 
simultaneously.


On the most recently hourly run, the servers without virtualization took 
50 seconds, the servers with virtualization and more memory took only 16 
seconds, so it looks like this problem has nothing to do with SolrJ, 
it's due to the 1000 clause queries actually taking a long time to 
execute.  The 16 second runtime is still longer than the last run by the 
perl program (12 seconds), but I am also executing an index rebuild in 
the build cores on those servers, so I'm not overly concerned by that.


At this point there isn't any way for me to know whether the speedup 
with the old server builds is due to the extra memory (OS disk cache) or 
due to some quirk of virtualization.  I'm really hoping it's due to the 
extra memory, because I really don't want to go back to a virtualized 
environment.  I'll be able to figure it out after I eliminate my current 
bug and complete the migration.


Thank you very much to everyone who offered assistance.  It helped me 
make sure my testing was as unbiased as I could achieve.


Shawn



form-data post to ExtractingRequestHandler with utf-8 characters not handled

2011-10-28 Thread kgoess
I'm trying to post a PDF along with a whole bunch of metadata fields to the
ExtractingRequestHandler as multipart/form-data.   It works fine except for
the utf-8 character handling.  Here is what my post looks like (abridged):

   POST /solr/update/extract HTTP/1.1
   TE: deflate,gzip;q=0.3
   Connection: TE, close
   Host: localhost:8983
   Content-Length: 21418
   Content-Type: multipart/form-data;
boundary=wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   
   --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   Content-Disposition: form-data; name=literal.title

   smart ‘ quote
   --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   
   Content-Disposition: form-data; name=myfile;
filename=text.pdf.1174588823
   Content-Type: application/pdf
   Content-Transfer-Encoding: binary

   ...binary pdf data

I've verified on the network that the quote character, a LEFT SINGLE
QUOTATION MARK (U+2018) is going across the wire as the utf-8 bytes e2 80
98 which is correct.  However, when I search for the document in Solr, it's
coming back as the byte sequence c3 a2 c2 80 c2 98 which I'm guessing is
it being double-utf8-encoded.

The multipart/form-data is MIME, which is supposed to be 7-bit, so I've
tried encoding any non-ascii fields as quoted-printable

   Content-Disposition: form-data; name=literal.title
   Content-Transfer-Encoding: quoted-printable

   smart =E2=80=98 quote=

as well as base64

   Content-Disposition: form-data; name=literal.title
   Content-Transfer-Encoding: base64

   c21hcnQgPj7igJg8PCBxdW90ZSBmb29iYXI=

but what sold puts in its index is just that value, it's not decoding either
the quoted-printable or the base64.  I've tried encoding the utf-8 values as
HTML entities, but then Solr doesn't unescape them either, and any accented
characters are stored as the HTML entities, not as the unicode characters.

Can anybody give me any pointers as to where I might be going wrong, where
to look for solutions, or any different/better ways to handle this?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3461731.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Partial updates?

2011-10-28 Thread mlevy
An ability to update would be extremely useful for us. Different parts of
records sometimes come from different databases, and being able to update
after creation of the Solr index would be extremely useful.

I've made some processes that reads a record and adds a new field to it. The
most awkward thing is when there's been a CopyField, when the record is read
and re-saved, the copied field causes CopyField to be invoked again.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html
Sent from the Solr - User mailing list archive at Nabble.com.


large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Roman Alekseenkov
Hi everyone,

I'm looking for some help with Solr indexing issues on a large scale.

We are indexing few terabytes/month on a sizeable Solr cluster (8
masters / serving writes, 16 slaves / serving reads). After certain
amount of tuning we got to the point where a single Solr instance can
handle index size of 100GB without much issues, but after that we are
starting to observe noticeable delays on index flush and they are
getting larger. See the attached picture for details, it's done for a
single JVM on a single machine.

We are posting data in 8 threads using javabin format and doing commit
every 5K documents, merge factor 20, and ram buffer size about 384MB.
From the picture it can be seen that a single-threaded index flushing
code kicks in on every commit and blocks all other indexing threads.
The hardware is decent (12 physical / 24 virtual cores per machine)
and it is mostly idle when the index is flushing. Very little CPU
utilization and disk I/O (5%), with the exception of a single CPU
core which actually does index flush (95% CPU, 5% I/O wait).

My questions are:

1) will Solr changes from real-time branch help to resolve these
issues? I was reading
http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
and it looks like we have exactly the same problem

2) what would be the best way to port these (and only these) changes
to 3.4.0? I tried to dig into the branching and revisions, but got
lost quickly. Tried something like svn diff
[…]realtime_search@r953476 […]realtime_search@r1097767, but I'm not
sure if it's even possible to merge these into 3.4.0

3) what would you recommend for production 24/7 use? 3.4.0?

4) is there a workaround that can be used? also, I listed the stack trace below

Thank you!
Roman

P.S. This single index flushing thread spends 99% of all the time in
org.apache.lucene.index.BufferedDeletesStream.applyDeletes, and then
the merge seems to go quickly. I looked it up and it looks like the
intent here is deleting old commit points (we are keeping only 1
non-optimized commit point per config). Not sure why is it taking that
long.

pool-2-thread-1 [RUNNABLE] CPU time: 3:31
java.nio.Bits.copyToByteArray(long, Object, long, long)
java.nio.DirectByteBuffer.get(byte[], int, int)
org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int)
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.TermInfosReader.init(Directory, String,
FieldInfos, int, int)
org.apache.lucene.index.SegmentCoreReaders.init(SegmentReader,
Directory, SegmentInfo, int, int)
org.apache.lucene.index.SegmentReader.get(boolean, Directory,
SegmentInfo, int, boolean, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
boolean, int, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
List)
org.apache.lucene.index.IndexWriter.doFlush(boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask$Sync.innerRun()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Roman Alekseenkov
I'm wondering if this is relevant:
https://issues.apache.org/jira/browse/LUCENE-2680 - Improve how
IndexWriter flushes deletes against existing segments

Roman

On Fri, Oct 28, 2011 at 11:38 AM, Roman Alekseenkov
ralekseen...@gmail.com wrote:
 Hi everyone,

 I'm looking for some help with Solr indexing issues on a large scale.

 We are indexing few terabytes/month on a sizeable Solr cluster (8
 masters / serving writes, 16 slaves / serving reads). After certain
 amount of tuning we got to the point where a single Solr instance can
 handle index size of 100GB without much issues, but after that we are
 starting to observe noticeable delays on index flush and they are
 getting larger. See the attached picture for details, it's done for a
 single JVM on a single machine.

 We are posting data in 8 threads using javabin format and doing commit
 every 5K documents, merge factor 20, and ram buffer size about 384MB.
 From the picture it can be seen that a single-threaded index flushing
 code kicks in on every commit and blocks all other indexing threads.
 The hardware is decent (12 physical / 24 virtual cores per machine)
 and it is mostly idle when the index is flushing. Very little CPU
 utilization and disk I/O (5%), with the exception of a single CPU
 core which actually does index flush (95% CPU, 5% I/O wait).

 My questions are:

 1) will Solr changes from real-time branch help to resolve these
 issues? I was reading
 http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
 and it looks like we have exactly the same problem

 2) what would be the best way to port these (and only these) changes
 to 3.4.0? I tried to dig into the branching and revisions, but got
 lost quickly. Tried something like svn diff
 […]realtime_search@r953476 […]realtime_search@r1097767, but I'm not
 sure if it's even possible to merge these into 3.4.0

 3) what would you recommend for production 24/7 use? 3.4.0?

 4) is there a workaround that can be used? also, I listed the stack trace 
 below

 Thank you!
 Roman

 P.S. This single index flushing thread spends 99% of all the time in
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes, and then
 the merge seems to go quickly. I looked it up and it looks like the
 intent here is deleting old commit points (we are keeping only 1
 non-optimized commit point per config). Not sure why is it taking that
 long.

 pool-2-thread-1 [RUNNABLE] CPU time: 3:31
 java.nio.Bits.copyToByteArray(long, Object, long, long)
 java.nio.DirectByteBuffer.get(byte[], int, int)
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
 int)
 org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
 org.apache.lucene.index.SegmentTermEnum.next()
 org.apache.lucene.index.TermInfosReader.init(Directory, String,
 FieldInfos, int, int)
 org.apache.lucene.index.SegmentCoreReaders.init(SegmentReader,
 Directory, SegmentInfo, int, int)
 org.apache.lucene.index.SegmentReader.get(boolean, Directory,
 SegmentInfo, int, boolean, int)
 org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
 boolean, int, int)
 org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
 List)
 org.apache.lucene.index.IndexWriter.doFlush(boolean)
 org.apache.lucene.index.IndexWriter.flush(boolean, boolean)
 org.apache.lucene.index.IndexWriter.closeInternal(boolean)
 org.apache.lucene.index.IndexWriter.close(boolean)
 org.apache.lucene.index.IndexWriter.close()
 org.apache.solr.update.SolrIndexWriter.close()
 org.apache.solr.update.DirectUpdateHandler2.closeWriter()
 org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
 org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run()
 java.util.concurrent.Executors$RunnableAdapter.call()
 java.util.concurrent.FutureTask$Sync.innerRun()
 java.util.concurrent.FutureTask.run()
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask)
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
 java.util.concurrent.ThreadPoolExecutor$Worker.run()
 java.lang.Thread.run()



Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Simon Willnauer
Hey Roman,

On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
ralekseen...@gmail.com wrote:
 Hi everyone,

 I'm looking for some help with Solr indexing issues on a large scale.

 We are indexing few terabytes/month on a sizeable Solr cluster (8
 masters / serving writes, 16 slaves / serving reads). After certain
 amount of tuning we got to the point where a single Solr instance can
 handle index size of 100GB without much issues, but after that we are
 starting to observe noticeable delays on index flush and they are
 getting larger. See the attached picture for details, it's done for a
 single JVM on a single machine.

 We are posting data in 8 threads using javabin format and doing commit
 every 5K documents, merge factor 20, and ram buffer size about 384MB.
 From the picture it can be seen that a single-threaded index flushing
 code kicks in on every commit and blocks all other indexing threads.
 The hardware is decent (12 physical / 24 virtual cores per machine)
 and it is mostly idle when the index is flushing. Very little CPU
 utilization and disk I/O (5%), with the exception of a single CPU
 core which actually does index flush (95% CPU, 5% I/O wait).

 My questions are:

 1) will Solr changes from real-time branch help to resolve these
 issues? I was reading
 http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
 and it looks like we have exactly the same problem

did you also read http://bit.ly/ujLw6v - here I try to explain the
major difference between Lucene 3.x and 4.0 and why 3.x has these long
idle times. In Lucene 3.x a full flush / commit is a single threaded
process, as you observed there is only one thread making progress. In
Lucene 4 there is still a single thread executing the commit but other
threads are not blocked anymore. Depending on how fast the thread can
flush other threads might help flushing segments for that commit
concurrently or simply index into new documents writers. So basically
4.0 won't have this problem anymore. The realtime branch you talk
about is already merged into 4.0 trunk.


 2) what would be the best way to port these (and only these) changes
 to 3.4.0? I tried to dig into the branching and revisions, but got
 lost quickly. Tried something like svn diff
 […]realtime_search@r953476 […]realtime_search@r1097767, but I'm not
 sure if it's even possible to merge these into 3.4.0

Possible yes! Worth the trouble, I would say no!
DocumentsWriterPerThread (DWPT) is a very big change and I don't think
we should backport this into our stable branch. However, this feature
is very stable in 4.0 though.

 3) what would you recommend for production 24/7 use? 3.4.0?

I think 3.4 is a safe bet! I personally tend to use trunk in
production too the only problem is that this is basically a moving
target and introduces extra overhead on your side to watch changes and
index format modification which could basically prevent you from
simple upgrades


 4) is there a workaround that can be used? also, I listed the stack trace 
 below

 Thank you!
 Roman

 P.S. This single index flushing thread spends 99% of all the time in
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes, and then
 the merge seems to go quickly. I looked it up and it looks like the
 intent here is deleting old commit points (we are keeping only 1
 non-optimized commit point per config). Not sure why is it taking that
 long.

in 3.x there is no way to apply deletes without doing a flush (afaik).
In 3.x a flush means single threaded again - similar to commit just
without syncing files to disk and writing a new segments file. In 4.0
you have way more control over this via
IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
without blocking other threads. In trunk we hijack indexing threads to
do all that work concurrently so you get better cpu utilization and
due to concurrent flushing better and usually continuous IO
utilization.

hope that helps.

simon

 pool-2-thread-1 [RUNNABLE] CPU time: 3:31
 java.nio.Bits.copyToByteArray(long, Object, long, long)
 java.nio.DirectByteBuffer.get(byte[], int, int)
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
 int)
 org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
 org.apache.lucene.index.SegmentTermEnum.next()
 org.apache.lucene.index.TermInfosReader.init(Directory, String,
 FieldInfos, int, int)
 org.apache.lucene.index.SegmentCoreReaders.init(SegmentReader,
 Directory, SegmentInfo, int, int)
 org.apache.lucene.index.SegmentReader.get(boolean, Directory,
 SegmentInfo, int, boolean, int)
 org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
 boolean, int, int)
 org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
 List)
 org.apache.lucene.index.IndexWriter.doFlush(boolean)
 org.apache.lucene.index.IndexWriter.flush(boolean, boolean)
 

Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Simon Willnauer
On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 Hey Roman,

 On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
 ralekseen...@gmail.com wrote:
 Hi everyone,

 I'm looking for some help with Solr indexing issues on a large scale.

 We are indexing few terabytes/month on a sizeable Solr cluster (8
 masters / serving writes, 16 slaves / serving reads). After certain
 amount of tuning we got to the point where a single Solr instance can
 handle index size of 100GB without much issues, but after that we are
 starting to observe noticeable delays on index flush and they are
 getting larger. See the attached picture for details, it's done for a
 single JVM on a single machine.

 We are posting data in 8 threads using javabin format and doing commit
 every 5K documents, merge factor 20, and ram buffer size about 384MB.
 From the picture it can be seen that a single-threaded index flushing
 code kicks in on every commit and blocks all other indexing threads.
 The hardware is decent (12 physical / 24 virtual cores per machine)
 and it is mostly idle when the index is flushing. Very little CPU
 utilization and disk I/O (5%), with the exception of a single CPU
 core which actually does index flush (95% CPU, 5% I/O wait).

 My questions are:

 1) will Solr changes from real-time branch help to resolve these
 issues? I was reading
 http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
 and it looks like we have exactly the same problem

 did you also read http://bit.ly/ujLw6v - here I try to explain the
 major difference between Lucene 3.x and 4.0 and why 3.x has these long
 idle times. In Lucene 3.x a full flush / commit is a single threaded
 process, as you observed there is only one thread making progress. In
 Lucene 4 there is still a single thread executing the commit but other
 threads are not blocked anymore. Depending on how fast the thread can
 flush other threads might help flushing segments for that commit
 concurrently or simply index into new documents writers. So basically
 4.0 won't have this problem anymore. The realtime branch you talk
 about is already merged into 4.0 trunk.


 2) what would be the best way to port these (and only these) changes
 to 3.4.0? I tried to dig into the branching and revisions, but got
 lost quickly. Tried something like svn diff
 […]realtime_search@r953476 […]realtime_search@r1097767, but I'm not
 sure if it's even possible to merge these into 3.4.0

 Possible yes! Worth the trouble, I would say no!
 DocumentsWriterPerThread (DWPT) is a very big change and I don't think
 we should backport this into our stable branch. However, this feature
 is very stable in 4.0 though.

 3) what would you recommend for production 24/7 use? 3.4.0?

 I think 3.4 is a safe bet! I personally tend to use trunk in
 production too the only problem is that this is basically a moving
 target and introduces extra overhead on your side to watch changes and
 index format modification which could basically prevent you from
 simple upgrades


 4) is there a workaround that can be used? also, I listed the stack trace 
 below

 Thank you!
 Roman

 P.S. This single index flushing thread spends 99% of all the time in
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes, and then
 the merge seems to go quickly. I looked it up and it looks like the
 intent here is deleting old commit points (we are keeping only 1
 non-optimized commit point per config). Not sure why is it taking that
 long.

 in 3.x there is no way to apply deletes without doing a flush (afaik).
 In 3.x a flush means single threaded again - similar to commit just
 without syncing files to disk and writing a new segments file. In 4.0
 you have way more control over this via
 IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
 without blocking other threads. In trunk we hijack indexing threads to
 do all that work concurrently so you get better cpu utilization and
 due to concurrent flushing better and usually continuous IO
 utilization.

 hope that helps.

 simon

 pool-2-thread-1 [RUNNABLE] CPU time: 3:31
 java.nio.Bits.copyToByteArray(long, Object, long, long)
 java.nio.DirectByteBuffer.get(byte[], int, int)
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
 int)
 org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
 org.apache.lucene.index.SegmentTermEnum.next()
 org.apache.lucene.index.TermInfosReader.init(Directory, String,
 FieldInfos, int, int)
 org.apache.lucene.index.SegmentCoreReaders.init(SegmentReader,
 Directory, SegmentInfo, int, int)
 org.apache.lucene.index.SegmentReader.get(boolean, Directory,
 SegmentInfo, int, boolean, int)
 org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
 boolean, int, int)
 org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
 List)
 

Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
 We should maybe try to fix this in 3.x too?

+1 I suggested it should be backported a while back.  Or that Lucene
4.x should be released.  I'm not sure what is holding up Lucene 4.x at
this point, bulk postings is only needed useful for PFOR.

On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 Hey Roman,

 On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
 ralekseen...@gmail.com wrote:
 Hi everyone,

 I'm looking for some help with Solr indexing issues on a large scale.

 We are indexing few terabytes/month on a sizeable Solr cluster (8
 masters / serving writes, 16 slaves / serving reads). After certain
 amount of tuning we got to the point where a single Solr instance can
 handle index size of 100GB without much issues, but after that we are
 starting to observe noticeable delays on index flush and they are
 getting larger. See the attached picture for details, it's done for a
 single JVM on a single machine.

 We are posting data in 8 threads using javabin format and doing commit
 every 5K documents, merge factor 20, and ram buffer size about 384MB.
 From the picture it can be seen that a single-threaded index flushing
 code kicks in on every commit and blocks all other indexing threads.
 The hardware is decent (12 physical / 24 virtual cores per machine)
 and it is mostly idle when the index is flushing. Very little CPU
 utilization and disk I/O (5%), with the exception of a single CPU
 core which actually does index flush (95% CPU, 5% I/O wait).

 My questions are:

 1) will Solr changes from real-time branch help to resolve these
 issues? I was reading
 http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
 and it looks like we have exactly the same problem

 did you also read http://bit.ly/ujLw6v - here I try to explain the
 major difference between Lucene 3.x and 4.0 and why 3.x has these long
 idle times. In Lucene 3.x a full flush / commit is a single threaded
 process, as you observed there is only one thread making progress. In
 Lucene 4 there is still a single thread executing the commit but other
 threads are not blocked anymore. Depending on how fast the thread can
 flush other threads might help flushing segments for that commit
 concurrently or simply index into new documents writers. So basically
 4.0 won't have this problem anymore. The realtime branch you talk
 about is already merged into 4.0 trunk.


 2) what would be the best way to port these (and only these) changes
 to 3.4.0? I tried to dig into the branching and revisions, but got
 lost quickly. Tried something like svn diff
 […]realtime_search@r953476 […]realtime_search@r1097767, but I'm not
 sure if it's even possible to merge these into 3.4.0

 Possible yes! Worth the trouble, I would say no!
 DocumentsWriterPerThread (DWPT) is a very big change and I don't think
 we should backport this into our stable branch. However, this feature
 is very stable in 4.0 though.

 3) what would you recommend for production 24/7 use? 3.4.0?

 I think 3.4 is a safe bet! I personally tend to use trunk in
 production too the only problem is that this is basically a moving
 target and introduces extra overhead on your side to watch changes and
 index format modification which could basically prevent you from
 simple upgrades


 4) is there a workaround that can be used? also, I listed the stack trace 
 below

 Thank you!
 Roman

 P.S. This single index flushing thread spends 99% of all the time in
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes, and then
 the merge seems to go quickly. I looked it up and it looks like the
 intent here is deleting old commit points (we are keeping only 1
 non-optimized commit point per config). Not sure why is it taking that
 long.

 in 3.x there is no way to apply deletes without doing a flush (afaik).
 In 3.x a flush means single threaded again - similar to commit just
 without syncing files to disk and writing a new segments file. In 4.0
 you have way more control over this via
 IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
 without blocking other threads. In trunk we hijack indexing threads to
 do all that work concurrently so you get better cpu utilization and
 due to concurrent flushing better and usually continuous IO
 utilization.

 hope that helps.

 simon

 pool-2-thread-1 [RUNNABLE] CPU time: 3:31
 java.nio.Bits.copyToByteArray(long, Object, long, long)
 java.nio.DirectByteBuffer.get(byte[], int, int)
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
 int)
 org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
 org.apache.lucene.index.SegmentTermEnum.next()
 org.apache.lucene.index.TermInfosReader.init(Directory, String,
 FieldInfos, int, int)
 org.apache.lucene.index.SegmentCoreReaders.init(SegmentReader,
 Directory, SegmentInfo, int, int)
 

RE: Partial updates?

2011-10-28 Thread Brandon Ramirez
I would love to see this too.  Most of our data comes from a relational 
database, but there are some files on the file system related to our products 
that may need to be indexed.  The files have different change control / life 
cycle, so I can't be sure that our application will know when this data  
changes, so a recurring background re-index job would be helpful.  Having to go 
to the database to get 99% of the data (which didn't change anyway) to send 
along with the 1% from the file system is a big limitation.

This also prevents the use of DIH.


Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848 
Software Engineer II | Element K | www.elementk.com


-Original Message-
From: mlevy [mailto:ml...@ushmm.org] 
Sent: Friday, October 28, 2011 2:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Partial updates?

An ability to update would be extremely useful for us. Different parts of 
records sometimes come from different databases, and being able to update after 
creation of the Solr index would be extremely useful.

I've made some processes that reads a record and adds a new field to it. The 
most awkward thing is when there's been a CopyField, when the record is read 
and re-saved, the copied field causes CopyField to be invoked again.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Robert Muir
On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:

 +1 I suggested it should be backported a while back.  Or that Lucene
 4.x should be released.  I'm not sure what is holding up Lucene 4.x at
 this point, bulk postings is only needed useful for PFOR.

This is not true, most modern index compression schemes, not just
PFOR-delta read more than one integer at a time.

Thats why its important not only to abstract away the encoding of the
index, but to also ensure that the enumeration apis aren't biased
towards one-at-a-time vInt.

Otherwise we have flexible indexing where flexible means slower
if you do anything but the default.

-- 
lucidimagination.com


edismax/boost: certain documents should be last

2011-10-28 Thread Paul
(I am using solr 3.4 and edismax.)

In my index, I have a multivalued field named genre. One of the
values this field can have is Citation. I would like documents that
have a genre field of Citation to always be at the bottom of the
search results.

I've been experimenting, but I can't seem to figure out the syntax of
the search I need. Here is the search that seems most logical to me
(newlines added here for readability):

q=%2bcontent%3Anotes+genre%3ACitation^0.01
start=0
rows=3
fl=genre+title
version=2.2
defType=edismax

I get the same results whether I include genre%3ACitation^0.01 or not.

Just to see if my names were correct, I put a minus sign before
genre and it did, in fact, stop returning all the documents
containing Citation.

What am I doing wrong?

Here are the results from the above query:

response
  lst name=responseHeader
int name=status0/int
int name=QTime1/int
lst name=params
  str name=flgenre title /str
  str name=start0/str
  str name=q+content:notes genre:Citation^0.01/str
  str name=rows3/str
  str name=version2.2/str
  str name=defTypeedismax/str
/lst
  /lst
  result name=response numFound=1276 start=0
doc
  arr name=genrestrCitation/strstrFiction/str/arr
  str name=titleNotes on novelists With some other notes/str
/doc
doc
  arr name=genrestrCitation/str/arr
  str name=titleNovel notes/str
/doc
doc
  arr name=genrestrCitation/str/arr
  str name=titleKnock about notes/str
/doc
  /result
/response


i don't get why this says non-match

2011-10-28 Thread Robert Petersen
It looks to me like everything matches down the line but top level says
otherQuery is a non-match... I don't get it?
- response
- lst name=responseHeader
  int name=status0/int 
  int name=QTime77/int 
- lst name=params
  str name=explainOtherSyncMaster/str 
  str name=fl*,score/str 
  str name=debugQueryon/str 
  str name=indenton/str 
  str name=start0/str 
  str name=q+syncmaster -SyncMaster/str 
  str name=hl.fl / 
  str name=qtstandard/str 
  str name=wtstandard/str 
  str name=fq / 
  str name=rows41/str 
  str name=version2.2/str 
  /lst
  /lst
+ result name=response numFound=26 start=0 maxScore=1.6049292
- lst name=debug
  str name=rawquerystring+syncmaster -SyncMaster/str 
  str name=querystring+syncmaster -SyncMaster/str 
  str name=parsedquery+moreWords:syncmaster
-MultiPhraseQuery(moreWords:sync (master syncmaster))/str 
  str name=parsedquery_toString+moreWords:syncmaster
-moreWords:sync (master syncmaster)/str
str name=otherQuerySyncMaster/str 
- lst name=explainOther
str name=2097309980.0 = (NON-MATCH) Failure to meet condition(s) of
required/prohibited clause(s) 1.4043131 = (MATCH)
fieldWeight(moreWords:syncmaster in 46710), product of: 1.4142135 =
tf(termFreq(moreWords:syncmaster)=2) 9.078851 = idf(docFreq=41,
maxDocs=135472) 0.109375 = fieldNorm(field=moreWords, doc=46710) 0.0 =
match on prohibited clause (moreWords:sync (master syncmaster))
9.393997 = (MATCH) weight(moreWords:sync (master syncmaster) in
46710), product of: 2.5863855 = queryWeight(moreWords:sync (master
syncmaster)), product of: 23.481407 = idf(moreWords:sync (master
syncmaster)) 0.1101461 = queryNorm 3.6320949 = (MATCH)
fieldWeight(moreWords:sync (master syncmaster) in 46710), product of:
1.4142135 = tf(phraseFreq=2.0) 23.481407 = idf(moreWords:sync (master
syncmaster)) 0.109375 = fieldNorm(field=moreWords, doc=46710)/str 



Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
 Otherwise we have flexible indexing where flexible means slower
 if you do anything but the default.

The other encodings should exist as modules since they are pluggable.
4.0 can ship with the existing codec.  4.1 with additional codecs and
the bulk postings at a later time.

Otherwise it will be 6 months before 4.0 ships, that's too long.

Also it is an amusing contradiction that your argument flies in the
face of Lucid shipping 4.x today without said functionality.

On Fri, Oct 28, 2011 at 5:09 PM, Robert Muir rcm...@gmail.com wrote:
 On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:

 +1 I suggested it should be backported a while back.  Or that Lucene
 4.x should be released.  I'm not sure what is holding up Lucene 4.x at
 this point, bulk postings is only needed useful for PFOR.

 This is not true, most modern index compression schemes, not just
 PFOR-delta read more than one integer at a time.

 Thats why its important not only to abstract away the encoding of the
 index, but to also ensure that the enumeration apis aren't biased
 towards one-at-a-time vInt.

 Otherwise we have flexible indexing where flexible means slower
 if you do anything but the default.

 --
 lucidimagination.com



Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Robert Muir
On Fri, Oct 28, 2011 at 8:10 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Otherwise we have flexible indexing where flexible means slower
 if you do anything but the default.

 The other encodings should exist as modules since they are pluggable.
 4.0 can ship with the existing codec.  4.1 with additional codecs and
 the bulk postings at a later time.

you don't know what you are talking about:  go look at the source
code. the whole problem is that encodings aren't pluggable.


 Otherwise it will be 6 months before 4.0 ships, that's too long.

sucks for you.


 Also it is an amusing contradiction that your argument flies in the
 face of Lucid shipping 4.x today without said functionality.


No it doesn't. trunk is open source. you can use it, too, if you want.

-- 
lucidimagination.com


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
 abstract away the encoding of the index

Robert, this is what you wrote.  Abstract away the encoding of the
index means pluggable, otherwise it's not abstract and / or it's a
flawed design.  Sounds like it's the latter.


Re: URL Redirect

2011-10-28 Thread prr
Finotti Simone tech178 at yoox.com writes:

 
 Hello,
 
 I have been assigned the task to migrate from Endeca to Solr.
 
 The former engine allowed me to set keyword triggers that, when matched
exactly, caused the web client to
 redirect to a specified URL.
 
 Does that feature exist in Solr? If so, where can I get some info?
 
 Thank you



Hi, Iam also looking out for migrating from Endeca to Solr , but on the first
look it looks extremely tedious to me ...please pass on any tips or how to
approach the problem..