Replication failed without an error =(

2012-04-24 Thread stockii
hello..

anyone a idea how i can figure out why my replication failed ? i got no
errors  =(

my configuratio is.

2 server! both are master and slave at the same time. only one server makes
updates and is so the master. on slave is started via cron a replication. is
one server crashed, i can easy switch master to slave, this is because both
are master AND slave at the same time.

this works well but now no replicate is working since i deleted the
pollInterval !?!? is this a reason?

thx

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replication-failed-without-an-error-tp3934655p3934655.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multi-words synonyms matching

2012-04-24 Thread elisabeth benoit
Hello,

I'd like to resume this post.

The only way I found to do not split synonyms in words in synonyms.txt it
to use the line

 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory/

in schema.xml

where tokenizerFactory=solr.KeywordTokenizerFactory

instructs SynonymFilterFactory not to break synonyms into words on white
spaces when parsing synonyms file.

So now it works fine, mairie is mapped into hotel de ville and when I
send request q=hotel de ville (quotes are mandatory to prevent analyzer
to split hotel de ville on white spaces), I get answers with word mairie.

But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), it
doesn't work!!!

CATEGORY_ANALYZED is same field type as default search field. This means
that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel de
ville, solr uses the same analyzer, the one with the line

filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory/.

Anyone as a clue what is different between q analysis behaviour and fq
analysis behaviour?

Thanks a lot
Elisabeth

2012/4/12 elisabeth benoit elisaelisael...@gmail.com

 oh, that's right.

 thanks a lot,
 Elisabeth


 2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com

 Elisabeth -

 As you described, below mapping might suit for your need.
 mairie = hotel de ville, mairie

 mairie gets expanded to hotel de ville and mairie at index time.  So
 mairie and hotel de ville searchable on document.

 However, still white space tokenizer splits at query time will be a
 problem as described by Markus.

 --Jeevanandam

 On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Yes, thanks, I've tried it but from what I undestand it doesn't solve my
  problem, since this means hotel de ville will be replace by mairie at
  index time (I use synonyms only at index time). So when user will ask
  hôtel de ville, it won't match.
 
  In fact, at index time I have mairie in my data, but I want user to be
 able
  to request mairie or hôtel de ville and have mairie as answer, and
 not
  have mairie as an answer when requesting hôtel.
 
 
  To map `mairie` to `hotel de ville` as single token you must escape
 your
  white
  space.
 
  mairie, hotel\ de\ ville
 
  This results in  a problem if your tokenizer splits on white space at
  query
  time.
 
  Ok, I guess this means I have a problem. No simple solution since at
 query
  time my tokenizer do split on white spaces.
 
  I guess my problem is more or less one of the problems discussed in
 
 
 http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
 
 
  Thanks a lot for your answers,
  Elisabeth
 
 
 
 
 
  2012/4/10 Erick Erickson erickerick...@gmail.com
 
  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Best
  Erick
 
  On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
  Hello,
 
  I've read several post on this issue, but can't find a real solution
 to
  my
  multi-words synonyms matching problem.
 
  I have in my synonyms.txt an entry like
 
  mairie, hotel de ville
 
  and my index time analyzer is configured as followed for synonyms.
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
 
  The problem I have is that now mairie matches with hotel and I
 would
  only want mairie to match with hotel de ville and mairie.
 
  When I look into the analyzer, I see that mairie is mapped into
  hotel,
  and words de ville are added in second and third position. To change
  that, I tried to do
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true
  tokenizerFactory=solr.KeywordTokenizerFactory/ (as I read in one
 post)
 
  and I can see now in the analyzer that mairie is mapped to hotel de
  ville, but now when I have query hotel de ville, it doesn't match
 at
  all
  with mairie.
 
  Anyone has a clue of what I'm doing wrong?
 
  I'm using Solr 3.4.
 
  Thanks,
  Elisabeth
 





Re: Replication failed without an error =(

2012-04-24 Thread stockii
bevore this problem i got this problem
https://issues.apache.org/jira/browse/SOLR-1781

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replication-failed-without-an-error-tp3934655p3934813.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Group by distance

2012-04-24 Thread ravicv
Use group=true and group.field in your query.
And your solr version should be solr 3.4 and above.

Thanks,
Ravi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Group-by-distance-tp3934876p3934886.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Group by distance

2012-04-24 Thread ViruS
I think this only can works when I have many records in same position.
My problem is to group witch short distance... like I say in last mail...
about 10km.
I need put markers on Poland country and display this.
Now I have 100k records, but in future I will have about 2mln records so I
must send grouped records.

Best,
Piotr

On 24 April 2012 12:08, ravicv ravichandra...@gmail.com wrote:

 Use group=true and group.field in your query.
 And your solr version should be solr 3.4 and above.

 Thanks,
 Ravi

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Group-by-distance-tp3934876p3934886.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Piotr (ViruS) Sikora
E-mail/JID: vi...@hostv.pl
http://piotrsikora.pl


Auto suggest on indexed file content filtered based on user

2012-04-24 Thread prakash_ajp
I am trying to implement an auto-suggest feature. The search feature already
exists and searches on file content in user's allotted workspace.

The following is from my schema that will be used for search indexing:

   field name=Text type=text indexed=true stored=false
multiValued=false/
   field name=UserName type=string indexed=true stored=true
multiValued=true/

The search result is filtered by the user name. The suggest is implemented
as a searchComponent and the field 'Text' is used by the suggester and would
have to be filtered the same way the search is done. The problem with this
approach is, suggest works on a single field and there is no way to include
the UserName field as a filter.

What's the best way out from here?

Thanks in advance!
Jay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3934565.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Deciding whether to stem at query time

2012-04-24 Thread Andrew Wagner
Ah, this is a really good point. Still seems like it has the downsides of
#2, though, much bigger space requirements and possibly some time lost on
queries.

On Mon, Apr 23, 2012 at 3:35 PM, Walter Underwood wun...@wunderwood.orgwrote:

 There is a third approach. Create two fields and always query both of
 them, with the exact field given a higher weight. This works great and
 performs well.

 It is what we did at Netflix and what I'm doing at Chegg.

 wunder

 On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote:

  So I just realized the other day that stemming basically happens at index
  time. If I'm understanding correctly, there's no way to allow a user to
  specify, at run time, whether to stem particular words or not based on a
  single index. I think there are two options, but I'd love to hear that
 I'm
  wrong:
 
  1.) Incrementally build up a white list of words that don't stem very
 well.
  To pick a random example out of the blue, light isn't super closely
  related to, lighter, so I might choose not to stem that. If I wanted to
  do this, I think (if I understand correctly), stemmerOverrideFilter would
  help me out with this. I'm not a big fan of this approach.
 
  2.) Index all the text in two fields, once with stemming and once
 without.
  Then build some kind of option into the UI for specifying whether to stem
  the words or not, and search the appropriate field. Unfortunately, this
  would roughly double the size of my index, and probably affect query
 times
  too. Plus, the UI would probably suck.
 
  Am I missing an option? Has anyone tried one of these approaches?
 
  Thanks!
  Andrew








Searching on fields with White Spaces

2012-04-24 Thread Shubham Srivastava
I have a custom fieldtype with the below config

fieldType name=text_ngram class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=10 /
filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone 
inject=true/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=10 /
  /analyzer
/fieldType


I have an Autocomplete configured on the same field which gives me result as 
expected. A new use case is to search kualalumpur or say newyork with out 
spaces returning Kuala Lumpur and New York which happen to be the original 
values.

What should be the recommended solution.

Regards,
Shubham




Re: Multi-words synonyms matching

2012-04-24 Thread Jeevanandam


usage of q and fq

q = is typically the main query for the search request

fq = is Filter Query; generally used to restrict the super set of 
documents without influencing score (more info. 
http://wiki.apache.org/solr/CommonQueryParameters#q)


For example:

q=hotel de ville === returns 100 documents

q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed === 
returns 40 documents from super set of 100 documents



hope this helps!

- Jeevanandam


On 24-04-2012 3:08 pm, elisabeth benoit wrote:

Hello,

I'd like to resume this post.

The only way I found to do not split synonyms in words in 
synonyms.txt it

to use the line

 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory/

in schema.xml

where tokenizerFactory=solr.KeywordTokenizerFactory

instructs SynonymFilterFactory not to break synonyms into words on 
white

spaces when parsing synonyms file.

So now it works fine, mairie is mapped into hotel de ville and 
when I
send request q=hotel de ville (quotes are mandatory to prevent 
analyzer
to split hotel de ville on white spaces), I get answers with word 
mairie.


But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), 
it

doesn't work!!!

CATEGORY_ANALYZED is same field type as default search field. This 
means
that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel 
de

ville, solr uses the same analyzer, the one with the line

filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory/.

Anyone as a clue what is different between q analysis behaviour and 
fq

analysis behaviour?

Thanks a lot
Elisabeth

2012/4/12 elisabeth benoit elisaelisael...@gmail.com


oh, that's right.

thanks a lot,
Elisabeth


2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com


Elisabeth -

As you described, below mapping might suit for your need.
mairie = hotel de ville, mairie

mairie gets expanded to hotel de ville and mairie at index 
time.  So

mairie and hotel de ville searchable on document.

However, still white space tokenizer splits at query time will be a
problem as described by Markus.

--Jeevanandam

On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

 Have you tried the =' mapping instead? Something
 like
 hotel de ville = mairie
 might work for you.

 Yes, thanks, I've tried it but from what I undestand it doesn't 
solve my
 problem, since this means hotel de ville will be replace by 
mairie at
 index time (I use synonyms only at index time). So when user will 
ask

 hôtel de ville, it won't match.

 In fact, at index time I have mairie in my data, but I want user 
to be

able
 to request mairie or hôtel de ville and have mairie as 
answer, and

not
 have mairie as an answer when requesting hôtel.


 To map `mairie` to `hotel de ville` as single token you must 
escape

your
 white
 space.

 mairie, hotel\ de\ ville

 This results in  a problem if your tokenizer splits on white 
space at

 query
 time.

 Ok, I guess this means I have a problem. No simple solution since 
at

query
 time my tokenizer do split on white spaces.

 I guess my problem is more or less one of the problems discussed 
in




http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215


 Thanks a lot for your answers,
 Elisabeth





 2012/4/10 Erick Erickson erickerick...@gmail.com

 Have you tried the =' mapping instead? Something
 like
 hotel de ville = mairie
 might work for you.

 Best
 Erick

 On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
 Hello,

 I've read several post on this issue, but can't find a real 
solution

to
 my
 multi-words synonyms matching problem.

 I have in my synonyms.txt an entry like

 mairie, hotel de ville

 and my index time analyzer is configured as followed for 
synonyms.


 filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt

 ignoreCase=true expand=true/

 The problem I have is that now mairie matches with hotel 
and I

would
 only want mairie to match with hotel de ville and mairie.

 When I look into the analyzer, I see that mairie is mapped 
into

 hotel,
 and words de ville are added in second and third position. To 
change

 that, I tried to do

 filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt

 ignoreCase=true expand=true
 tokenizerFactory=solr.KeywordTokenizerFactory/ (as I read in 
one

post)

 and I can see now in the analyzer that mairie is mapped to 
hotel de
 ville, but now when I have query hotel de ville, it doesn't 
match

at
 all
 with mairie.

 Anyone has a clue of what I'm doing wrong?

 I'm using Solr 3.4.

 Thanks,
 Elisabeth









Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread Jeevanandam


can you please share a sample query?

-Jeevanandam


On 24-04-2012 1:49 pm, prakash_ajp wrote:
I am trying to implement an auto-suggest feature. The search feature 
already

exists and searches on file content in user's allotted workspace.

The following is from my schema that will be used for search 
indexing:


   field name=Text type=text indexed=true stored=false
multiValued=false/
   field name=UserName type=string indexed=true stored=true
multiValued=true/

The search result is filtered by the user name. The suggest is 
implemented
as a searchComponent and the field 'Text' is used by the suggester 
and would
have to be filtered the same way the search is done. The problem with 
this
approach is, suggest works on a single field and there is no way to 
include

the UserName field as a filter.

What's the best way out from here?

Thanks in advance!
Jay

--
View this message in context:

http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3934565.html
Sent from the Solr - User mailing list archive at Nabble.com.


Recovery - too many updates received since start

2012-04-24 Thread Trym R. Møller

Hi

I experience that a Solr looses its connection with Zookeeper and 
re-establish it. After Solr is reconnection to Zookeeper it begins to 
recover.
It has been missing the connection approximately 10 seconds and 
meanwhile the leader slice has received some documents (maybe about 1000 
documents). Solr fails to update peer sync with the log message:

Apr 21, 2012 10:13:40 AM org.apache.solr.update.PeerSync sync
WARNING: PeerSync: core=mycollection_slice21_shard1 
url=zk-1:2181,zk-2:2181,zk-3:2181 too many updates received since start 
- startingUpdates no longer overlaps with our currentUpdates


Looking into PeerSync and UpdateLog I can see that 100 updates is the 
maximum allowed updates that a shard can be behind.
Is it correct that this is not configurable and what is the reasons for 
choosing 100?


I suspect that one must compare the work needed to replicate the full 
index with the performance loss/resource usage when enhancing the size 
of the UpdateLog?


Any comments regarding this is greatly appreciated.

Best regards Trym


JDBC import yields no data

2012-04-24 Thread Hasan Diwan
I'm trying to migrate from RDBMS to the Lucene ecosystem. To do this, I'm
trying to use the JDBC importer[1]. My configuration is given below:
dataConfig
  dataSource driver=net.sf.log4jdbc.DriverSpy user=sa
url=jdbc:log4jdbc:h2:tcp://192.168.1.6/finance/
  !-- dataSource driver=org.h2.Driver url=jdbc:h2:tcp://
192.168.1.6/finance user=sa / --
document
entity name=receipt query=SELECT 'transaction' as type,
currency, name, amount, done_on from receipts join app_users on user_id =
app_users.id
  deltaQuery=SELECT 'transaction' as type, name, currency, amount,
done_on from receipts join app_users on user_id = app_users.id where
done_on  '${dataimporter.last_index_time}'
field column=NAME name=name /
field column=NAME name=nameSort /
field column=NAME name=alphaNameSort /
field column=AMOUNT name=amount / !-- currencyField not
available till 3.6 --
field column=transaction_time name=done_on / !-- resolve
epoch time --
field column=location name=location/ !-- geospatial?? --
/entity
/document
/dataConfig
And the resulting query of *:*:
% curl http://192.168.1.6:8995/solr/db/select/?q=*%3A*;

   [~]
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr
name=q*:*/str/lst/lstresult name=response numFound=0
start=0/
/response
The SQL query does work properly, the relevant jars are in the lib
subdirectory. Help? -- H
-- 
Sent from my mobile device
Envoyait de mon portable
1. http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource


Recover - Read timed out

2012-04-24 Thread Trym R. Møller

Hi

I experience that a Solr looses its connection with Zookeeper and 
re-establish it. After Solr is reconnection to Zookeeper it begins to 
recover its replicas. It has been missing the connection approximately 
10 seconds and meanwhile the leader slice has received some documents 
(maybe about 1000 documents). Solr fails to update using peer sync and 
fails afterwards to do a full replicate with the log message below. The 
Solr from where the documents are replicated doesn't log anything when 
the replication is in progress. The full replica continues to fail with 
the read time out for about 10 hours and then Solr gives up.


1. How can I get more information about why the Read time out happens?
2. It seems like the Solr from where it replicates leaks a http 
connection each time (and a thread) having about 18.000 threads in 8 hours.


Any comments are welcome.

Best regards Trym

Apr 21, 2012 10:14:11 AM org.apache.solr.common.SolrException log
SEVERE: Error while trying to 
recover:org.apache.solr.client.solrj.SolrServerException: 
http://solr-ip:8983/solr/mycollection_slice21_shard2
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:493)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:264)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:103)
at 
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:180)
at 
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:156)
at 
org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:170)
at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:120)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:341)
at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:206)

Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at 
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at 
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at 
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
at 
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at 
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at 
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:440)

... 8 more


Re: Multi-words synonyms matching

2012-04-24 Thread elisabeth benoit
yes, thanks, but this is NOT my question.

I was wondering why I have multiple matches with q=hotel de ville and no
match with fq=CATEGORY_ANALYZED:hotel de ville, since in both case I'm
searching in the same solr fieldType.

Why is q parameter behaving differently in that case? Why do the quotes
work in one case and not in the other?

Does anyone know?

Thanks,
Elisabeth

2012/4/24 Jeevanandam je...@myjeeva.com


 usage of q and fq

 q = is typically the main query for the search request

 fq = is Filter Query; generally used to restrict the super set of
 documents without influencing score (more info.
 http://wiki.apache.org/solr/**CommonQueryParameters#qhttp://wiki.apache.org/solr/CommonQueryParameters#q
 )

 For example:
 
 q=hotel de ville === returns 100 documents

 q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed ===
 returns 40 documents from super set of 100 documents


 hope this helps!

 - Jeevanandam



 On 24-04-2012 3:08 pm, elisabeth benoit wrote:

 Hello,

 I'd like to resume this post.

 The only way I found to do not split synonyms in words in synonyms.txt it
 to use the line

  filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory/

 in schema.xml

 where tokenizerFactory=solr.**KeywordTokenizerFactory

 instructs SynonymFilterFactory not to break synonyms into words on white
 spaces when parsing synonyms file.

 So now it works fine, mairie is mapped into hotel de ville and when I
 send request q=hotel de ville (quotes are mandatory to prevent analyzer
 to split hotel de ville on white spaces), I get answers with word
 mairie.

 But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), it
 doesn't work!!!

 CATEGORY_ANALYZED is same field type as default search field. This means
 that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel de
 ville, solr uses the same analyzer, the one with the line

 filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory/.

 Anyone as a clue what is different between q analysis behaviour and fq
 analysis behaviour?

 Thanks a lot
 Elisabeth

 2012/4/12 elisabeth benoit elisaelisael...@gmail.com

  oh, that's right.

 thanks a lot,
 Elisabeth


 2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com

  Elisabeth -

 As you described, below mapping might suit for your need.
 mairie = hotel de ville, mairie

 mairie gets expanded to hotel de ville and mairie at index time.  So
 mairie and hotel de ville searchable on document.

 However, still white space tokenizer splits at query time will be a
 problem as described by Markus.

 --Jeevanandam

 On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Yes, thanks, I've tried it but from what I undestand it doesn't solve
 my
  problem, since this means hotel de ville will be replace by mairie at
  index time (I use synonyms only at index time). So when user will ask
  hôtel de ville, it won't match.
 
  In fact, at index time I have mairie in my data, but I want user to be
 able
  to request mairie or hôtel de ville and have mairie as answer, and
 not
  have mairie as an answer when requesting hôtel.
 
 
  To map `mairie` to `hotel de ville` as single token you must escape
 your
  white
  space.
 
  mairie, hotel\ de\ ville
 
  This results in  a problem if your tokenizer splits on white space
 at
  query
  time.
 
  Ok, I guess this means I have a problem. No simple solution since at
 query
  time my tokenizer do split on white spaces.
 
  I guess my problem is more or less one of the problems discussed in
 
 

 http://lucene.472066.n3.**nabble.com/Multi-word-**
 synonyms-td3716292.html#**a3717215http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
 
 
  Thanks a lot for your answers,
  Elisabeth
 
 
 
 
 
  2012/4/10 Erick Erickson erickerick...@gmail.com
 
  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Best
  Erick
 
  On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
  Hello,
 
  I've read several post on this issue, but can't find a real solution
 to
  my
  multi-words synonyms matching problem.
 
  I have in my synonyms.txt an entry like
 
  mairie, hotel de ville
 
  and my index time analyzer is configured as followed for synonyms.
 
  filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
 
  The problem I have is that now mairie matches with hotel and I
 would
  only want mairie to match with hotel de ville and mairie.
 
  When I look into the analyzer, I see that mairie is mapped into
  hotel,
  and words de ville are added in second and third position. To
 change
  that, I tried to do
 
  filter 

debugging junit test with eclipse

2012-04-24 Thread Bernd Fehling
I have tried all hints from internet for debugging a junit test of
solr 3.6 under eclipse but didn't succeed.

eclipse and everything is running, compiling, debugging with runjettyrun.
Tests have no errors.
Ant from command line ist also running with ivy, e.g.
ant -Dtestmethod=testUserFields -Dtestcase=TestExtendedDismaxParser 
test-solr-core

But I can't get a single test with junit running from eclipse and then
jump into it for debugging.

Any idea what's going wrong?

Regards
Bernd


Re: Deciding whether to stem at query time

2012-04-24 Thread Otis Gospodnetic
Hi Andrew,

This would not necessarily increase the size of your index that much - you 
don't to store both fields, just 1 of them if you really need it for 
highlighting or displaying.  If not, just index.

Otis 

Performance Monitoring for Solr - 
http://sematext.com/spm/solr-performance-monitoring




 From: Andrew Wagner wagner.and...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Tuesday, April 24, 2012 7:21 AM
Subject: Re: Deciding whether to stem at query time
 
Ah, this is a really good point. Still seems like it has the downsides of
#2, though, much bigger space requirements and possibly some time lost on
queries.

On Mon, Apr 23, 2012 at 3:35 PM, Walter Underwood wun...@wunderwood.orgwrote:

 There is a third approach. Create two fields and always query both of
 them, with the exact field given a higher weight. This works great and
 performs well.

 It is what we did at Netflix and what I'm doing at Chegg.

 wunder

 On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote:

  So I just realized the other day that stemming basically happens at index
  time. If I'm understanding correctly, there's no way to allow a user to
  specify, at run time, whether to stem particular words or not based on a
  single index. I think there are two options, but I'd love to hear that
 I'm
  wrong:
 
  1.) Incrementally build up a white list of words that don't stem very
 well.
  To pick a random example out of the blue, light isn't super closely
  related to, lighter, so I might choose not to stem that. If I wanted to
  do this, I think (if I understand correctly), stemmerOverrideFilter would
  help me out with this. I'm not a big fan of this approach.
 
  2.) Index all the text in two fields, once with stemming and once
 without.
  Then build some kind of option into the UI for specifying whether to stem
  the words or not, and search the appropriate field. Unfortunately, this
  would roughly double the size of my index, and probably affect query
 times
  too. Plus, the UI would probably suck.
 
  Am I missing an option? Has anyone tried one of these approaches?
 
  Thanks!
  Andrew










Query parsing VS marshalling/unmarshalling

2012-04-24 Thread Mindaugas Žakšauskas
Hi,

I maintain a distributed system which Solr is part of. The data which
is kept is Solr is permissioned and permissions are currently
implemented by taking the original user query, adding certain bits to
it which would make it return less data in the search results. Now I
am at the point where I need to go over this functionality and try to
improve it.

Changing this to send two separate queries (q=...fq=...) would be the
first logical thing to do, however I was thinking of an extra
improvement. Instead of generating filter query, converting it into a
String, sending over the HTTP just to parse it by Solr again - would
it not be better to take generated Lucene fq query, serialize it using
Java serialization, convert it to, say, Base64 and then send and
deserialize it on the Solr end? Has anyone tried doing any performance
comparisons on this topic?

I am particularly concerned about this because in extreme cases my
filter queries can be very large (1000s of characters long) and we
already had to do tweaks as the size of GET requests would exceed
default limits. And yes, we could move to POST but I would like to
minimize both the amount of data that is sent over and the time taken
to parse large queries.

Thanks in advance.

m.


Re: Query parsing VS marshalling/unmarshalling

2012-04-24 Thread Benson Margulies
2012/4/24 Mindaugas Žakšauskas min...@gmail.com:
 Hi,

 I maintain a distributed system which Solr is part of. The data which
 is kept is Solr is permissioned and permissions are currently
 implemented by taking the original user query, adding certain bits to
 it which would make it return less data in the search results. Now I
 am at the point where I need to go over this functionality and try to
 improve it.

 Changing this to send two separate queries (q=...fq=...) would be the
 first logical thing to do, however I was thinking of an extra
 improvement. Instead of generating filter query, converting it into a
 String, sending over the HTTP just to parse it by Solr again - would
 it not be better to take generated Lucene fq query, serialize it using
 Java serialization, convert it to, say, Base64 and then send and
 deserialize it on the Solr end? Has anyone tried doing any performance
 comparisons on this topic?

I'm about to try out a contribution for serializing queries in
Javascript using Jackson. I've previously done this by serializing my
own data structure and putting the JSON into a custom query parameter.



 I am particularly concerned about this because in extreme cases my
 filter queries can be very large (1000s of characters long) and we
 already had to do tweaks as the size of GET requests would exceed
 default limits. And yes, we could move to POST but I would like to
 minimize both the amount of data that is sent over and the time taken
 to parse large queries.

 Thanks in advance.

 m.


Re: Deciding whether to stem at query time

2012-04-24 Thread Paul Libbrecht

Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit :
 This would not necessarily increase the size of your index that much - you 
 don't to store both fields, just 1 of them if you really need it for 
 highlighting or displaying.  If not, just index.

I second this.
The query expansion process is far from being a slow thing... you can easily 
expand to tens of fields with a fairly small penalty.

Where you have a penalty is at stored fields... these need to be really 
carefully avoided as much as possible.
As long as you keep them small, the legendary performance of SOLR will still 
hold.

paul

Re: Deciding whether to stem at query time

2012-04-24 Thread Andrew Wagner
I'm sorry, I'm missing something. What's the difference between storing
and indexing a field?

On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote:


 Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit :
  This would not necessarily increase the size of your index that much -
 you don't to store both fields, just 1 of them if you really need it for
 highlighting or displaying.  If not, just index.

 I second this.
 The query expansion process is far from being a slow thing... you can
 easily expand to tens of fields with a fairly small penalty.

 Where you have a penalty is at stored fields... these need to be really
 carefully avoided as much as possible.
 As long as you keep them small, the legendary performance of SOLR will
 still hold.

 paul


Re: Query parsing VS marshalling/unmarshalling

2012-04-24 Thread Mindaugas Žakšauskas
On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies bimargul...@gmail.com wrote:
 I'm about to try out a contribution for serializing queries in
 Javascript using Jackson. I've previously done this by serializing my
 own data structure and putting the JSON into a custom query parameter.

Thanks for your reply. Appreciate your effort, but I'm not sure if I
fully understand the gain.

Having data in JSON would still require it to be converted into Lucene
Query at the end which takes space  CPU effort, right? Or are you
saying that having query serialized into a structured data blob (JSON
in this case) makes it somehow easier to convert it into Lucene Query?

I only thought about Java serialization because:
- it's rather close to the in-object format
- the mechanism is rather stable and is an established standard in Java/JVM
- Lucene Queries seem to implement java.io.Serializable (haven't done
a thorough check but looks good on the surface)
- other conversions (e.g. using Xtream) are either slow or require
custom annotations. I personally don't see how would Lucene/Solr
include them in their core classes.

Anyway, it would still be interesting to hear if anyone could
elaborate on query parsing complexity.

m.


RE: JDBC import yields no data

2012-04-24 Thread Dyer, James
You might also want to show us your dataimport handler configuration from 
solrconfig.xml and also the url you're using to start the data import.  When 
its complete, browsing to http://192.168.1.6:8995/solr/db/dataimport; (or 
whatever the DIH handler name is in your config) should say indexing complete 
and also the number of documents it imported.  Also, if you have commit=false 
in your config, it won't issue a commit so you won't see the documents.

If it fails, your servlet container's logs should have a stack trace or 
something indicating what the failure was.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Hasan Diwan [mailto:hasan.di...@gmail.com] 
Sent: Tuesday, April 24, 2012 8:51 AM
To: solr-user@lucene.apache.org
Subject: JDBC import yields no data

I'm trying to migrate from RDBMS to the Lucene ecosystem. To do this, I'm
trying to use the JDBC importer[1]. My configuration is given below:
dataConfig
  dataSource driver=net.sf.log4jdbc.DriverSpy user=sa
url=jdbc:log4jdbc:h2:tcp://192.168.1.6/finance/
  !-- dataSource driver=org.h2.Driver url=jdbc:h2:tcp://
192.168.1.6/finance user=sa / --
document
entity name=receipt query=SELECT 'transaction' as type,
currency, name, amount, done_on from receipts join app_users on user_id =
app_users.id
  deltaQuery=SELECT 'transaction' as type, name, currency, amount,
done_on from receipts join app_users on user_id = app_users.id where
done_on  '${dataimporter.last_index_time}'
field column=NAME name=name /
field column=NAME name=nameSort /
field column=NAME name=alphaNameSort /
field column=AMOUNT name=amount / !-- currencyField not
available till 3.6 --
field column=transaction_time name=done_on / !-- resolve
epoch time --
field column=location name=location/ !-- geospatial?? --
/entity
/document
/dataConfig
And the resulting query of *:*:
% curl http://192.168.1.6:8995/solr/db/select/?q=*%3A*;

   [~]
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr
name=q*:*/str/lst/lstresult name=response numFound=0
start=0/
/response
The SQL query does work properly, the relevant jars are in the lib
subdirectory. Help? -- H
-- 
Sent from my mobile device
Envoyait de mon portable
1. http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource


Re: Group by distance

2012-04-24 Thread Erick Erickson
What do you mean by grouped? It's relatively easy to return
only documents within a certain radius, and it's also easy to
return the results ordered by distance.

Here's a good place to start:
http://wiki.apache.org/solr/SpatialSearch#geofilt_-_The_distance_filter

Best
Erick

On Tue, Apr 24, 2012 at 6:33 AM, ViruS svi...@gmail.com wrote:
 I think this only can works when I have many records in same position.
 My problem is to group witch short distance... like I say in last mail...
 about 10km.
 I need put markers on Poland country and display this.
 Now I have 100k records, but in future I will have about 2mln records so I
 must send grouped records.

 Best,
 Piotr

 On 24 April 2012 12:08, ravicv ravichandra...@gmail.com wrote:

 Use group=true and group.field in your query.
 And your solr version should be solr 3.4 and above.

 Thanks,
 Ravi

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Group-by-distance-tp3934876p3934886.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Piotr (ViruS) Sikora
 E-mail/JID: vi...@hostv.pl
 http://piotrsikora.pl


Stats Component and solrj

2012-04-24 Thread Erik Fäßler
Hey all,

I'd like to know how many terms I have in a particular field in a search. In 
other words, I want to know how many facets I have in that field. I use string 
fields, there are no numbers. I wanted to use the Stats Component and use its 
count value. When trying this out in the browser, everything works like 
expected.
However, when I want to do the same thing in my Java web app, I get an error 
because in FieldStatsInfo.class it says

 min = (Double)entry.getValue();

Where 'entry.getValue()' is a String because I have a string field here. Thus, 
I get an error that String cannot be cast to Double.
In the browser I just got a String returned here, probably relative to an 
lexicographical order.

I switched the Stats Component on with

query.setGetFieldStatistics(authors);

Where 'authors' is a field with author names.
Is it possible that solrj not yet works with the Stats Component on string 
fields? I tried Solr 3.5 and 3.6 without success. Is there another easy way to 
get the count I want? Will solrj be fixed? Or am I just doing an error?

Best regards,

Erik

correct location in chain for EdgeNGramFilterFactory ?

2012-04-24 Thread geeky2
hello all,

i want to experiment with the EdgeNGramFilterFactory at index time.

i believe this needs to go in post tokenization - but i am doing a pattern
replace as well as other things.

should the EdgeNGramFilterFactory go in right after the pattern replace?




fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/


filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.PatternReplaceFilterFactory pattern=\.
replacement= replace=all/

*put EdgeNGramFilterFactory here === ?*

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=1 splitOnCaseChange=1
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.PatternReplaceFilterFactory pattern=\.
replacement= replace=all/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType

thanks for any help,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/correct-location-in-chain-for-EdgeNGramFilterFactory-tp3935589p3935589.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multi-words synonyms matching

2012-04-24 Thread Erick Erickson
Elisabeth:

What shows up in the debug section of the response when you add
debugQuery=on? There should be some bit of that section like:
parsed_filter_queries

My other question is are you absolutely sure that your
CATEGORY_ANALYZED field has the correct content?. How does it
get populated?

Nothing jumps out at me here

Best
Erick

On Tue, Apr 24, 2012 at 9:55 AM, elisabeth benoit
elisaelisael...@gmail.com wrote:
 yes, thanks, but this is NOT my question.

 I was wondering why I have multiple matches with q=hotel de ville and no
 match with fq=CATEGORY_ANALYZED:hotel de ville, since in both case I'm
 searching in the same solr fieldType.

 Why is q parameter behaving differently in that case? Why do the quotes
 work in one case and not in the other?

 Does anyone know?

 Thanks,
 Elisabeth

 2012/4/24 Jeevanandam je...@myjeeva.com


 usage of q and fq

 q = is typically the main query for the search request

 fq = is Filter Query; generally used to restrict the super set of
 documents without influencing score (more info.
 http://wiki.apache.org/solr/**CommonQueryParameters#qhttp://wiki.apache.org/solr/CommonQueryParameters#q
 )

 For example:
 
 q=hotel de ville === returns 100 documents

 q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed ===
 returns 40 documents from super set of 100 documents


 hope this helps!

 - Jeevanandam



 On 24-04-2012 3:08 pm, elisabeth benoit wrote:

 Hello,

 I'd like to resume this post.

 The only way I found to do not split synonyms in words in synonyms.txt it
 to use the line

  filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory/

 in schema.xml

 where tokenizerFactory=solr.**KeywordTokenizerFactory

 instructs SynonymFilterFactory not to break synonyms into words on white
 spaces when parsing synonyms file.

 So now it works fine, mairie is mapped into hotel de ville and when I
 send request q=hotel de ville (quotes are mandatory to prevent analyzer
 to split hotel de ville on white spaces), I get answers with word
 mairie.

 But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), it
 doesn't work!!!

 CATEGORY_ANALYZED is same field type as default search field. This means
 that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel de
 ville, solr uses the same analyzer, the one with the line

 filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory/.

 Anyone as a clue what is different between q analysis behaviour and fq
 analysis behaviour?

 Thanks a lot
 Elisabeth

 2012/4/12 elisabeth benoit elisaelisael...@gmail.com

  oh, that's right.

 thanks a lot,
 Elisabeth


 2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com

  Elisabeth -

 As you described, below mapping might suit for your need.
 mairie = hotel de ville, mairie

 mairie gets expanded to hotel de ville and mairie at index time.  So
 mairie and hotel de ville searchable on document.

 However, still white space tokenizer splits at query time will be a
 problem as described by Markus.

 --Jeevanandam

 On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Yes, thanks, I've tried it but from what I undestand it doesn't solve
 my
  problem, since this means hotel de ville will be replace by mairie at
  index time (I use synonyms only at index time). So when user will ask
  hôtel de ville, it won't match.
 
  In fact, at index time I have mairie in my data, but I want user to be
 able
  to request mairie or hôtel de ville and have mairie as answer, and
 not
  have mairie as an answer when requesting hôtel.
 
 
  To map `mairie` to `hotel de ville` as single token you must escape
 your
  white
  space.
 
  mairie, hotel\ de\ ville
 
  This results in  a problem if your tokenizer splits on white space
 at
  query
  time.
 
  Ok, I guess this means I have a problem. No simple solution since at
 query
  time my tokenizer do split on white spaces.
 
  I guess my problem is more or less one of the problems discussed in
 
 

 http://lucene.472066.n3.**nabble.com/Multi-word-**
 synonyms-td3716292.html#**a3717215http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
 
 
  Thanks a lot for your answers,
  Elisabeth
 
 
 
 
 
  2012/4/10 Erick Erickson erickerick...@gmail.com
 
  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Best
  Erick
 
  On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
  Hello,
 
  I've read several post on this issue, but can't find a real solution
 to
  my
  multi-words synonyms matching problem.
 
  I have in my synonyms.txt an entry like
 
  mairie, hotel de ville
 
  and my index time analyzer is configured as followed for synonyms.
 
  

Re: Deciding whether to stem at query time

2012-04-24 Thread Erick Erickson
When you set store=true in your schema, a verbatim copy of
the raw input is placed in the *.fdt file. That is the information
returned when you specify the fl parameter for instance.

When you set index=true, the input is analyzed and the
resulting terms are placed in the inverted index and are
searchable.

The two are essentially completely orthogonal for all you
specify them at the same time.

So, a field that's stored but not indexed would be displayable
to the user, but no searches could be performed on it.

A field indexed but stored can be searched, but the information
is not retrievable.

Why are there two options? Well, you may use copyField to
index the data two different ways for two different purposes, as
in this thread. Putting the verbatim data in twice is wasteful,
you only ever need it once.

Why store in the first palce? Because all that gets into the
inverted index is the results of the analysis. So if you indexed
story with stemming turned on, it might result in stori being
in the index. And if you use phonetic filters, it's much worse,
your terms will be something like UNT4 or KMPT which are
totally unsuitable to show the user. So if you want to _search_
phonetically but display the field to the user, you would both
index and store.

And even if you could recover the terms from the inverted
index as they were fed in, it would be a very expensive
process. Luke does this, you might try reconstructing
a document with Luke to see what a reconstructed doc
looks like, and how long it takes.

Hope that helps
Erick

On Tue, Apr 24, 2012 at 10:40 AM, Andrew Wagner wagner.and...@gmail.com wrote:
 I'm sorry, I'm missing something. What's the difference between storing
 and indexing a field?

 On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote:


 Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit :
  This would not necessarily increase the size of your index that much -
 you don't to store both fields, just 1 of them if you really need it for
 highlighting or displaying.  If not, just index.

 I second this.
 The query expansion process is far from being a slow thing... you can
 easily expand to tens of fields with a fairly small penalty.

 Where you have a penalty is at stored fields... these need to be really
 carefully avoided as much as possible.
 As long as you keep them small, the legendary performance of SOLR will
 still hold.

 paul


Re: Query parsing VS marshalling/unmarshalling

2012-04-24 Thread Erick Erickson
In general, query parsing is such a small fraction of the total time that,
almost no matter how complex, it's not worth worrying about. To see
this, attach debugQuery=on to your query and look at the timings
in the pepare and process portions of the response. I'd  be
very sure that it was a problem before spending any time trying to make
the transmission of the data across the wire more efficient, my first
reaction is that this is premature optimization.

Second, you could do this on the server side with a custom query
component if you chose. You can freely modify the query
over there and it may make sense in your situation.

Third, consider no cache filters, which were developed for
expensive filter queries, ACL being one of them. See:
https://issues.apache.org/jira/browse/SOLR-2429

Fourth, I'd ask if there's a way to reduce the size of the FQ
clause. Is this on a particular user basis or groups basis?
If you can get this down to a few groups that would help. Although
there's often some outlier who is member of thousands of
groups :(.

Best
Erick


2012/4/24 Mindaugas Žakšauskas min...@gmail.com:
 On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I'm about to try out a contribution for serializing queries in
 Javascript using Jackson. I've previously done this by serializing my
 own data structure and putting the JSON into a custom query parameter.

 Thanks for your reply. Appreciate your effort, but I'm not sure if I
 fully understand the gain.

 Having data in JSON would still require it to be converted into Lucene
 Query at the end which takes space  CPU effort, right? Or are you
 saying that having query serialized into a structured data blob (JSON
 in this case) makes it somehow easier to convert it into Lucene Query?

 I only thought about Java serialization because:
 - it's rather close to the in-object format
 - the mechanism is rather stable and is an established standard in Java/JVM
 - Lucene Queries seem to implement java.io.Serializable (haven't done
 a thorough check but looks good on the surface)
 - other conversions (e.g. using Xtream) are either slow or require
 custom annotations. I personally don't see how would Lucene/Solr
 include them in their core classes.

 Anyway, it would still be interesting to hear if anyone could
 elaborate on query parsing complexity.

 m.


Re: solr replication failing with error: Master at: is not available. Index fetch failed

2012-04-24 Thread geeky2
hello,

thank you for the reply,

yes - master has been indexed.

ok - makes sense - the polling interval needs to change

i did check the solr war file on both boxes (master and slave).  they are
identical.  actually - if they were not indentical - this would point to a
different issue altogether - since our deployment infrastructure - rolls the
war file to the slaves when you do a deployment on the master.

this has me stumped - not sure what to check next.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-replication-failing-with-error-Master-at-is-not-available-Index-fetch-failed-tp3932921p3935699.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query parsing VS marshalling/unmarshalling

2012-04-24 Thread Mindaugas Žakšauskas
Hi Erick,

Thanks for looking into this and for the tips you've sent.

I am leaning towards custom query component at the moment, the primary
reason for it would be to be able to squeeze the amount of data that
is sent over to Solr. A single round trip within the same datacenter
is worth around 0.5 ms [1] and if query doesn't fit into a single
ethernet packet, this number effectively has to double/triple/etc.

Regarding cache filters - I was actually thinking the opposite:
caching ACL queries (filter queries) would be beneficial as those tend
to be the same across multiple search requests.

[1] 
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/stanford-295-talk.pdf
, slide 13

m.

On Tue, Apr 24, 2012 at 4:43 PM, Erick Erickson erickerick...@gmail.com wrote:
 In general, query parsing is such a small fraction of the total time that,
 almost no matter how complex, it's not worth worrying about. To see
 this, attach debugQuery=on to your query and look at the timings
 in the pepare and process portions of the response. I'd  be
 very sure that it was a problem before spending any time trying to make
 the transmission of the data across the wire more efficient, my first
 reaction is that this is premature optimization.

 Second, you could do this on the server side with a custom query
 component if you chose. You can freely modify the query
 over there and it may make sense in your situation.

 Third, consider no cache filters, which were developed for
 expensive filter queries, ACL being one of them. See:
 https://issues.apache.org/jira/browse/SOLR-2429

 Fourth, I'd ask if there's a way to reduce the size of the FQ
 clause. Is this on a particular user basis or groups basis?
 If you can get this down to a few groups that would help. Although
 there's often some outlier who is member of thousands of
 groups :(.

 Best
 Erick


 2012/4/24 Mindaugas Žakšauskas min...@gmail.com:
 On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I'm about to try out a contribution for serializing queries in
 Javascript using Jackson. I've previously done this by serializing my
 own data structure and putting the JSON into a custom query parameter.

 Thanks for your reply. Appreciate your effort, but I'm not sure if I
 fully understand the gain.

 Having data in JSON would still require it to be converted into Lucene
 Query at the end which takes space  CPU effort, right? Or are you
 saying that having query serialized into a structured data blob (JSON
 in this case) makes it somehow easier to convert it into Lucene Query?

 I only thought about Java serialization because:
 - it's rather close to the in-object format
 - the mechanism is rather stable and is an established standard in Java/JVM
 - Lucene Queries seem to implement java.io.Serializable (haven't done
 a thorough check but looks good on the surface)
 - other conversions (e.g. using Xtream) are either slow or require
 custom annotations. I personally don't see how would Lucene/Solr
 include them in their core classes.

 Anyway, it would still be interesting to hear if anyone could
 elaborate on query parsing complexity.

 m.


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread prakash_ajp
Right now, the query is a very simple one, something like q=text. Basically,
it would return ['textview', 'textviewer', ..]

But the issue is, the 'textviewer' could be from a file that is out of
bounds for this user. So, ultimately I would like to include the userName in
the query. As mentioned earlier, userName is another field in the main
index.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3935765.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr replication failing with error: Master at: is not available. Index fetch failed

2012-04-24 Thread Rahul Warawdekar
Hi,

In Solr wiki, for replication, the master url is defined as follows
str name=masterUrlhttp://master_host:port
/solr/corename/replication/str

This url does not contain admin in its path where as in the master url
provided by you, you have an additional admin in the url.
Not very sure if this might be an issue but you can just check removing
admin and check if replication works.


On Tue, Apr 24, 2012 at 11:49 AM, geeky2 gee...@hotmail.com wrote:

 hello,

 thank you for the reply,

 yes - master has been indexed.

 ok - makes sense - the polling interval needs to change

 i did check the solr war file on both boxes (master and slave).  they are
 identical.  actually - if they were not indentical - this would point to a
 different issue altogether - since our deployment infrastructure - rolls
 the
 war file to the slaves when you do a deployment on the master.

 this has me stumped - not sure what to check next.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-replication-failing-with-error-Master-at-is-not-available-Index-fetch-failed-tp3932921p3935699.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Thanks and Regards
Rahul A. Warawdekar


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread Jeevanandam Madanagopal
On Apr 24, 2012, at 9:37 PM, prakash_ajp wrote:

 Right now, the query is a very simple one, something like q=text. Basically,
 it would return ['textview', 'textviewer', ..]
   hmm, so you're using default query field

 
 But the issue is, the 'textviewer' could be from a file that is out of
 bounds for this user. So, ultimately I would like to include the userName in
 the query. As mentioned earlier, userName is another field in the main
 index.
   and you like to filter the result set along with userName field value
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3935765.html
 Sent from the Solr - User mailing list archive at Nabble.com.

in this scenario 'fq' parameter will facilitate to achieve your desire result.
Please refer http://wiki.apache.org/solr/CommonQueryParameters#fq

try this   q=textfq=userName:prakash

Let us know!

-Jeevanandam



Re: JDBC import yields no data

2012-04-24 Thread Hasan Diwan
On 24 April 2012 07:49, Dyer, James james.d...@ingrambook.com wrote:

 You might also want to show us your dataimport handler configuration
 from solrconfig.xml and also the url you're using to start the data import.
  When its complete, browsing to 
 http://192.168.1.6:8995/solr/db/dataimport; (or whatever the DIH handler
 name is in your config) should say indexing complete and also the number
 of documents it imported.  Also, if you have commit=false in your config,
 it won't issue a commit so you won't see the documents.


solrconfig.xml:
?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

config

  luceneMatchVersionLUCENE_35/luceneMatchVersion

  jmx /

  !-- Set this to 'false' if you want solr to continue working after it
has
   encountered an severe configuration error.  In a production
environment,
   you may want solr to keep working even if one handler is
mis-configured.

   You may also set this to false using by setting the system property:
 -Dsolr.abortOnConfigurationError=false
 --

abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError

  lib dir=../../../../dist/
regex=apache-solr-dataimporthandler-.*\.jar /

  indexDefaults
   !-- Values here affect all index writers and act as a default unless
overridden. --
useCompoundFilefalse/useCompoundFile

mergeFactor10/mergeFactor
!--
 If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will
flush based on whichever limit is hit first.

 --
!--maxBufferedDocs1000/maxBufferedDocs--
!-- Tell Lucene when to flush documents to disk.
Giving Lucene more memory for indexing means faster indexing at the
cost of more RAM

If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will
flush based on whichever limit is hit first.

--
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout

!--
 Expert:
 The Merge Policy in Lucene controls how merging is handled by Lucene.
 The default in 2.3 is the LogByteSizeMergePolicy, previous
 versions used LogDocMergePolicy.

 LogByteSizeMergePolicy chooses segments to merge based on their size.
 The Lucene 2.2 default, LogDocMergePolicy chose when
 to merge based on number of documents

 Other implementations of MergePolicy must have a no-argument
constructor
 --

!--mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy/mergePolicy--

!--
 Expert:
 The Merge Scheduler in Lucene controls how merges are performed.  The
ConcurrentMergeScheduler (Lucene 2.3 default)
  can perform merges in the background using separate threads.  The
SerialMergeScheduler (Lucene 2.2 default) does not.
 --

!--mergeSchedulerorg.apache.lucene.index.ConcurrentMergeScheduler/mergeScheduler--

!--
  As long as Solr is the only process modifying your index, it is
  safe to use Lucene's in process locking mechanism.  But you may
  specify one of the other Lucene LockFactory implementations in
  the event that you have a custom situation.

  none = NoLockFactory (typically only used with read only indexes)
  single = SingleInstanceLockFactory (suggested)
  native = NativeFSLockFactory
  simple = SimpleFSLockFactory

  ('simple' is the default for backwards compatibility with Solr 1.2)
--
lockTypesingle/lockType
  /indexDefaults

  mainIndex
!-- options specific to the main on-disk lucene index --
useCompoundFilefalse/useCompoundFile
ramBufferSizeMB32/ramBufferSizeMB
mergeFactor10/mergeFactor
!-- Deprecated --
!--maxBufferedDocs1000/maxBufferedDocs--
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength

!-- If true, unlock any held write or commit locks on startup.
 This defeats the locking mechanism that allows multiple
 processes to safely access a lucene index, and should be
 used with care.
 This is not needed if lock type is 'none' or 'single'
 --
unlockOnStartupfalse/unlockOnStartup
  /mainIndex

  !-- the default high-performance update handler --
  updateHandler 

Re: JDBC import yields no data

2012-04-24 Thread Gora Mohanty
On 24 April 2012 22:22, Hasan Diwan hasan.di...@gmail.com wrote:
[...]
 The dataimport url I'm using is
 http://192.168.1.6:8995/solr/db/dataimport?command=full-import

And, does it show you any output? As James mentions, it should
say busy while the data import is running, and indexing completed
when done. Also, is the above URL correct? /solr/db/ looks a little
odd, but that could have to do with how you have Solr set up.

My other guess would be that your JDBC set up is not correct.
For testing, you could try to simplify it by not using
net.sf.log4jdbc.DriverSpy , and trying directly with the H2
database JDBC driver.

Regards,
Gora


RE: JDBC import yields no data

2012-04-24 Thread Dyer, James
After you issue the full-import command with the url you gave:

http://192.168.1.6:8995/solr/db/dataimport?command=full-import

Paste the url in a web browser without the command

http://192.168.1.6:8995/solr/db/dataimport

It should be giving you status as to how many database calls its made, how many 
rows read  documents indexed.  Keep refreshing the page until it is done.  
When it finishes, you should get either a Success or a Failure message.  Is it 
saying success or failure?  Also how many documents does it say it indexed?

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Hasan Diwan [mailto:hasan.di...@gmail.com]
Sent: Tuesday, April 24, 2012 11:52 AM
To: solr-user@lucene.apache.org
Subject: Re: JDBC import yields no data

On 24 April 2012 07:49, Dyer, James james.d...@ingrambook.com wrote:

 You might also want to show us your dataimport handler configuration
 from solrconfig.xml and also the url you're using to start the data import.
  When its complete, browsing to 
 http://192.168.1.6:8995/solr/db/dataimport; (or whatever the DIH handler
 name is in your config) should say indexing complete and also the number
 of documents it imported.  Also, if you have commit=false in your config,
 it won't issue a commit so you won't see the documents.


solrconfig.xml:
?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

config

  luceneMatchVersionLUCENE_35/luceneMatchVersion

  jmx /

  !-- Set this to 'false' if you want solr to continue working after it
has
   encountered an severe configuration error.  In a production
environment,
   you may want solr to keep working even if one handler is
mis-configured.

   You may also set this to false using by setting the system property:
 -Dsolr.abortOnConfigurationError=false
 --

abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError

  lib dir=../../../../dist/
regex=apache-solr-dataimporthandler-.*\.jar /

  indexDefaults
   !-- Values here affect all index writers and act as a default unless
overridden. --
useCompoundFilefalse/useCompoundFile

mergeFactor10/mergeFactor
!--
 If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will
flush based on whichever limit is hit first.

 --
!--maxBufferedDocs1000/maxBufferedDocs--
!-- Tell Lucene when to flush documents to disk.
Giving Lucene more memory for indexing means faster indexing at the
cost of more RAM

If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will
flush based on whichever limit is hit first.

--
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout

!--
 Expert:
 The Merge Policy in Lucene controls how merging is handled by Lucene.
 The default in 2.3 is the LogByteSizeMergePolicy, previous
 versions used LogDocMergePolicy.

 LogByteSizeMergePolicy chooses segments to merge based on their size.
 The Lucene 2.2 default, LogDocMergePolicy chose when
 to merge based on number of documents

 Other implementations of MergePolicy must have a no-argument
constructor
 --

!--mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy/mergePolicy--

!--
 Expert:
 The Merge Scheduler in Lucene controls how merges are performed.  The
ConcurrentMergeScheduler (Lucene 2.3 default)
  can perform merges in the background using separate threads.  The
SerialMergeScheduler (Lucene 2.2 default) does not.
 --

!--mergeSchedulerorg.apache.lucene.index.ConcurrentMergeScheduler/mergeScheduler--

!--
  As long as Solr is the only process modifying your index, it is
  safe to use Lucene's in process locking mechanism.  But you may
  specify one of the other Lucene LockFactory implementations in
  the event that you have a custom situation.

  none = NoLockFactory (typically only used with read only indexes)
  single = SingleInstanceLockFactory (suggested)
  native = NativeFSLockFactory
  simple = SimpleFSLockFactory

  ('simple' is the default for backwards compatibility with Solr 1.2)
--
lockTypesingle/lockType

Re: Query parsing VS marshalling/unmarshalling

2012-04-24 Thread Erick Erickson
If you're assembling an fq clause, this is all done or you, although
you need to take some care to form the fq clause _exactly_
the same way each time. Think of the filterCache as a key/value
map where the key is the raw fq text and the value is the docs
satisfying that query.

So fq=acl:(a OR a) will not, for instance, match
 fq=acl:(b OR a)

FWIW
Erick

2012/4/24 Mindaugas Žakšauskas min...@gmail.com:
 Hi Erick,

 Thanks for looking into this and for the tips you've sent.

 I am leaning towards custom query component at the moment, the primary
 reason for it would be to be able to squeeze the amount of data that
 is sent over to Solr. A single round trip within the same datacenter
 is worth around 0.5 ms [1] and if query doesn't fit into a single
 ethernet packet, this number effectively has to double/triple/etc.

 Regarding cache filters - I was actually thinking the opposite:
 caching ACL queries (filter queries) would be beneficial as those tend
 to be the same across multiple search requests.

 [1] 
 http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/stanford-295-talk.pdf
 , slide 13

 m.

 On Tue, Apr 24, 2012 at 4:43 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 In general, query parsing is such a small fraction of the total time that,
 almost no matter how complex, it's not worth worrying about. To see
 this, attach debugQuery=on to your query and look at the timings
 in the pepare and process portions of the response. I'd  be
 very sure that it was a problem before spending any time trying to make
 the transmission of the data across the wire more efficient, my first
 reaction is that this is premature optimization.

 Second, you could do this on the server side with a custom query
 component if you chose. You can freely modify the query
 over there and it may make sense in your situation.

 Third, consider no cache filters, which were developed for
 expensive filter queries, ACL being one of them. See:
 https://issues.apache.org/jira/browse/SOLR-2429

 Fourth, I'd ask if there's a way to reduce the size of the FQ
 clause. Is this on a particular user basis or groups basis?
 If you can get this down to a few groups that would help. Although
 there's often some outlier who is member of thousands of
 groups :(.

 Best
 Erick


 2012/4/24 Mindaugas Žakšauskas min...@gmail.com:
 On Tue, Apr 24, 2012 at 3:27 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I'm about to try out a contribution for serializing queries in
 Javascript using Jackson. I've previously done this by serializing my
 own data structure and putting the JSON into a custom query parameter.

 Thanks for your reply. Appreciate your effort, but I'm not sure if I
 fully understand the gain.

 Having data in JSON would still require it to be converted into Lucene
 Query at the end which takes space  CPU effort, right? Or are you
 saying that having query serialized into a structured data blob (JSON
 in this case) makes it somehow easier to convert it into Lucene Query?

 I only thought about Java serialization because:
 - it's rather close to the in-object format
 - the mechanism is rather stable and is an established standard in Java/JVM
 - Lucene Queries seem to implement java.io.Serializable (haven't done
 a thorough check but looks good on the surface)
 - other conversions (e.g. using Xtream) are either slow or require
 custom annotations. I personally don't see how would Lucene/Solr
 include them in their core classes.

 Anyway, it would still be interesting to hear if anyone could
 elaborate on query parsing complexity.

 m.


Re: solr replication failing with error: Master at: is not available. Index fetch failed

2012-04-24 Thread geeky2
that was it!

thank you.

i did notice something else in the logs now ...

what is the meaning or implication of the message, Connection reset.?



2012-04-24 12:59:19,996 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 12:59:39,998 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
*2012-04-24 12:59:59,997 SEVERE [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Master at:
http://bogus:bogusport/somepath/somecore/replication/ is not available.
Index fetch failed. Exception: Connection reset*
2012-04-24 13:00:19,998 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:00:40,004 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:00:59,992 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:01:19,993 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:01:39,992 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:01:59,989 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:02:19,990 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:02:39,989 INFO  [org.apache.solr.handler.SnapPuller]
(pool-12-thread-1) Slave in sync with master.
2012-04-24 13:02:59,991 INFO  [org.a

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-replication-failing-with-error-Master-at-is-not-available-Index-fetch-failed-tp3932921p3936107.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread Klostermeyer, Michael
I'm new to Solr, but I would think the fq=[username] would work here.

http://wiki.apache.org/solr/CommonQueryParameters#fq

Mike

-Original Message-
From: prakash_ajp [mailto:prakash_...@yahoo.com] 
Sent: Tuesday, April 24, 2012 11:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Auto suggest on indexed file content filtered based on user

Right now, the query is a very simple one, something like q=text. Basically, it 
would return ['textview', 'textviewer', ..]

But the issue is, the 'textviewer' could be from a file that is out of bounds 
for this user. So, ultimately I would like to include the userName in the 
query. As mentioned earlier, userName is another field in the main index.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3935765.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread prakash_ajp
I read on a couple of other web pages that fq is not supported for suggester.
I even tried the query and it doesn't help. My understanding was, when the
suggest (spellcheck) index is built, only the field chosen is considered for
queries and the other fields from the main index are not available for
filtering purposes once the index is created.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3936144.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread Jeevanandam Madanagopal
yes only spellcheck indexed build field is for suggest query
I believe, filtering a documents on search handler using fq parameter and spell 
suggest are two part we are discussing here.

lets say you have field for spellcheck - used to build spell dictionary

field name=spell type=textSpell …. …  /

using copyField for populating a spell field and get dictionary created

referring spellcheck handler in the default search handler at 'last-components' 
section, like below
 arr name=last-components
   strspellcheck/str
 /arr

then you will be able to apply search documents filtering and spellcheck params 
to search handler while querying. 

detailed info http://wiki.apache.org/solr/SpellCheckComponent [probably you 
might have already went thru :) ]

-Jeevanandam


On Apr 25, 2012, at 12:01 AM, prakash_ajp wrote:

 I read on a couple of other web pages that fq is not supported for suggester.
 I even tried the query and it doesn't help. My understanding was, when the
 suggest (spellcheck) index is built, only the field chosen is considered for
 queries and the other fields from the main index are not available for
 filtering purposes once the index is created.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3936144.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Field names w/ leading digits cause strange behavior

2012-04-24 Thread bleakley
When specifying a field name that starts with a digit (or digits) in the fl
parameter solr returns both the field name and field value as the those
digits. For example, using nightly build
apache-solr-4.0-2012-04-24_08-27-47 I run:

java -jar start.jar
and
java -jar post.jar solr.xml monitor.xml

If I then add a field to the field list that starts with a digit (
localhost:8983/solr/select?q=*:*fl=24 ) the results look like:
...
doc
long name=2424/long
/doc
...

if I try fl=24_7 it looks like everything after the underscore is truncated
...
doc
long name=2424/long
/doc
...

and if I try fl=3test it looks like everything after the last digit is
truncated
...
doc
long name=33/long
/doc
...

If I have an actual value for that field (say I've indexed 24_7 to be true
) I get back that value as well as the behavior above.
...
doc
bool name=24_7true/bool
long name=2424/long
/doc
...

Is it ok the have fields that start with digits? If so, is there a different
way to specify them using the fl parameter? Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Field-names-w-leading-digits-cause-strange-behavior-tp3936354p3936354.html
Sent from the Solr - User mailing list archive at Nabble.com.


embedded solr populating field of type LatLonType

2012-04-24 Thread Jason Cunning
Hi,

I have a question concerning the spatial field type LatLonType and populating 
it via an embedded solr server in java.

So far I've only ever had to index simple types like boolean, float, and 
string. This is the first complex type. So I'd like to use the following field 
definition for example in my schema:

field name=coordinate type=LatLonType indexed=true stored=false 
multiValued=false/

And then I'd like to populate this field in java as in the following psuedo 
code:

public SolrInputDocument populate(AppropriateJavaType coordinate) {

SolrInputField inputField = new SolrInputField(coordinate);
inputField.addValue(coordinate, 1.0F);

SolrInputDocument inputDocument = new SolrInputDocument();
inputDocument.put(coordinate, inputField);

return inputDocument;
}

My question is, what is the AppropriateJavaType for populating a solr field of 
type LatLonType?

Thank you for your time.

Re: correct location in chain for EdgeNGramFilterFactory ?

2012-04-24 Thread Erick Erickson
Well, what effect do you _want_?

I'd probably put it after the PorterStemFilterFactory. As it is, it'll
form a bunch of ngrams, then WordDelimiterFilterFactory will
try to break them up according to _its_ rules and eventually
you'll be sending absolute gibberish to the stemmer. I mean
what is the stemmer going to think of (starting out with running)
ru, run, runn, runni, runnin, running?

I suggest you spend some time with admin/analysis with various
orderings to understand better how all the parts interact.

Best
Erick

On Tue, Apr 24, 2012 at 11:20 AM, geeky2 gee...@hotmail.com wrote:
 hello all,

 i want to experiment with the EdgeNGramFilterFactory at index time.

 i believe this needs to go in post tokenization - but i am doing a pattern
 replace as well as other things.

 should the EdgeNGramFilterFactory go in right after the pattern replace?




    fieldType name=text_en_splitting class=solr.TextField
 positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/


        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.PatternReplaceFilterFactory pattern=\.
 replacement= replace=all/

 *put EdgeNGramFilterFactory here === ?*

        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1
 preserveOriginal=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
        filter class=solr.PorterStemFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.PatternReplaceFilterFactory pattern=\.
 replacement= replace=all/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1
 preserveOriginal=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
        filter class=solr.PorterStemFilterFactory/
      /analyzer
    /fieldType

 thanks for any help,



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/correct-location-in-chain-for-EdgeNGramFilterFactory-tp3935589p3935589.html
 Sent from the Solr - User mailing list archive at Nabble.com.


faceted searches - design question - facet field not part of qf search fields

2012-04-24 Thread geeky2


hello all,

this is more of a design / newbie question on how others combine faceted
search fields in to their requestHandlers.

say you have a request handler set up like below.

does it make sense (from a design perspective) to add a faceted search field
that is NOT part of the main search fields (itemNo, productType, brand) in
the qf param?

for example, augment the requestHandler below to include a faceted search on
itemDesc?

would this be confusing ? - to be searching across three fields - but
offering faceted suggestions on itemDesc?

just trying to understand how others approach this

thanks

  requestHandler name=generalSearch class=solr.SearchHandler
default=false
lst name=defaults
  str name=defTypeedismax/str
  str name=echoParamsall/str
  int name=rows10/int
  str name=qfitemNo^1.0 productType^.8 brand^.5/str
  str name=q.alt*:*/str
/lst
lst name=appends
 /lst
lst name=invariants
  str name=facetfalse/str
/lst
  /requestHandler



  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/faceted-searches-design-question-facet-field-not-part-of-qf-search-fields-tp3936509p3936509.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread Erick Erickson
I don't know if there is a really good solution here. The problem is that
suggester (and the trunk FST version) simply traverse the terms in
the index. there's not even a real concept of those terms belonging to
any document. Since your security level is on a document basis, that
makes things hard.

How many users do you have? And do you ever expect to search
across more than one user's files? If not, you could consider having
one core per user. Then the suggestions would be correct and since
the searches would be against the user's core, they'd never see
any documents they didn't own.

But that solution has some complexity involved, and if you have a zillion
users it can be difficult to get right.

You could consider having separate (dynamically-defined) fields that
had the suggestion list for each individual user. that would be
administratively easier. Then you suggestions would simply go against
that user's suggestion field (suggestion_user1 e.g.).

None of this is elegant, but this is not an elegant problem given how
Solr is structured.

Best
Erick

On Tue, Apr 24, 2012 at 2:31 PM, prakash_ajp prakash_...@yahoo.com wrote:
 I read on a couple of other web pages that fq is not supported for suggester.
 I even tried the query and it doesn't help. My understanding was, when the
 suggest (spellcheck) index is built, only the field chosen is considered for
 queries and the other fields from the main index are not available for
 filtering purposes once the index is created.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3936144.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: faceted searches - design question - facet field not part of qf search fields

2012-04-24 Thread Erick Erickson
No problem here at all, it's done all the time. Consider a popular
facet series in the last day, in the last week, in the last month...
There's no reason you have to facet on the fields that are
searched on.

The user as search terms like my dog has fleas and your query
looks like
q=my dog has fleasfq=timestamp:[NOW/DAY TO NOW/DAY+1DAY]
and the user sees all documents with those terms added since midnight
last night. No confusion at all...

Best
Erick


On Tue, Apr 24, 2012 at 4:28 PM, geeky2 gee...@hotmail.com wrote:


 hello all,

 this is more of a design / newbie question on how others combine faceted
 search fields in to their requestHandlers.

 say you have a request handler set up like below.

 does it make sense (from a design perspective) to add a faceted search field
 that is NOT part of the main search fields (itemNo, productType, brand) in
 the qf param?

 for example, augment the requestHandler below to include a faceted search on
 itemDesc?

 would this be confusing ? - to be searching across three fields - but
 offering faceted suggestions on itemDesc?

 just trying to understand how others approach this

 thanks

  requestHandler name=generalSearch class=solr.SearchHandler
 default=false
    lst name=defaults
      str name=defTypeedismax/str
      str name=echoParamsall/str
      int name=rows10/int
      str name=qfitemNo^1.0 productType^.8 brand^.5/str
      str name=q.alt*:*/str
    /lst
    lst name=appends
     /lst
    lst name=invariants
      str name=facetfalse/str
    /lst
  /requestHandler






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/faceted-searches-design-question-facet-field-not-part-of-qf-search-fields-tp3936509p3936509.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Field names w/ leading digits cause strange behavior

2012-04-24 Thread Erick Erickson
Hmmm, this does NOT happen on 3.6, and it DOES happen on
trunk. Sure sounds like a JIRA to me, would you mind raising one?

I can't imagine this is desired behavior, it's just weird.

Thanks for pointing this out!
Erick

On Tue, Apr 24, 2012 at 3:38 PM, bleakley bleak...@factual.com wrote:
 When specifying a field name that starts with a digit (or digits) in the fl
 parameter solr returns both the field name and field value as the those
 digits. For example, using nightly build
 apache-solr-4.0-2012-04-24_08-27-47 I run:

 java -jar start.jar
 and
 java -jar post.jar solr.xml monitor.xml

 If I then add a field to the field list that starts with a digit (
 localhost:8983/solr/select?q=*:*fl=24 ) the results look like:
 ...
 doc
 long name=2424/long
 /doc
 ...

 if I try fl=24_7 it looks like everything after the underscore is truncated
 ...
 doc
 long name=2424/long
 /doc
 ...

 and if I try fl=3test it looks like everything after the last digit is
 truncated
 ...
 doc
 long name=33/long
 /doc
 ...

 If I have an actual value for that field (say I've indexed 24_7 to be true
 ) I get back that value as well as the behavior above.
 ...
 doc
 bool name=24_7true/bool
 long name=2424/long
 /doc
 ...

 Is it ok the have fields that start with digits? If so, is there a different
 way to specify them using the fl parameter? Thanks!

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Field-names-w-leading-digits-cause-strange-behavior-tp3936354p3936354.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: faceted searches - design question - facet field not part of qf search fields

2012-04-24 Thread Chris Hostetter
: 
: The user as search terms like my dog has fleas and your query
: looks like
: q=my dog has fleasfq=timestamp:[NOW/DAY TO NOW/DAY+1DAY]
: and the user sees all documents with those terms added since midnight
: last night. No confusion at all...

right ... wether the facets are useful or confusing has nothing to do with 
wether the fields are in your qf ... what matters is what you *do* with 
those facet counts once you have them.

if you over the user the ability to filter on a constraint (which is what 
most people do with facet info) then as long as you generate that filter 
using hte same field, as an fq, then everything should make sense.

if instead you just try to add the constraint to your main q query 
string, as an additional clause, then that is likely to make no sense at 
all, since the terms from your facet field may not have any bearing on the 
fields you are querying against.


-Hoss


Re: Field names w/ leading digits cause strange behavior

2012-04-24 Thread bleakley
Thank you for verifying the issue. I've created a ticket at
https://issues.apache.org/jira/browse/SOLR-3407

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Field-names-w-leading-digits-cause-strange-behavior-tp3936354p3936599.html
Sent from the Solr - User mailing list archive at Nabble.com.


Title Boosting and IDF

2012-04-24 Thread Tavi Nathanson
Hey everyone,

I field documents by title and body. The title field often has far fewer
terms than the body field. IDF, as a result, will have a profound effect in
the title field compared to the body field.

I currently have the title field boosted by 4x relative to the body field.
While I want matches in the title field to result in higher scores than
matches in the body field, I don't believe I want the title to completely
trump the body. I've seen this happen when a rare term is present in the
title field, and IDF combines with the 4x boost to wreak havoc.

I'd like to get your thoughts on the following:

- Is it standard practice to avoid boosting the title field much, because of
the (generally) high IDF of title field terms?
- Are there other strategies for handling the high IDF of a title field?

Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Title-Boosting-and-IDF-tp3936709p3936709.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread Doug Mittendorf
Another option is to use faceting (via the facet.prefix param) for your 
auto-suggest.  It's not as fast and scalable as using one of the 
Suggester implementations, but it does allow arbitrary fq parameters to 
be included in the request to limit the results.


http://wiki.apache.org/solr/SimpleFacetParameters#Facet_prefix_.28term_suggest.29

Doug

On 04/24/2012 04:30 PM, Erick Erickson wrote:

I don't know if there is a really good solution here. The problem is that
suggester (and the trunk FST version) simply traverse the terms in
the index. there's not even a real concept of those terms belonging to
any document. Since your security level is on a document basis, that
makes things hard.

How many users do you have? And do you ever expect to search
across more than one user's files? If not, you could consider having
one core per user. Then the suggestions would be correct and since
the searches would be against the user's core, they'd never see
any documents they didn't own.

But that solution has some complexity involved, and if you have a zillion
users it can be difficult to get right.

You could consider having separate (dynamically-defined) fields that
had the suggestion list for each individual user. that would be
administratively easier. Then you suggestions would simply go against
that user's suggestion field (suggestion_user1 e.g.).

None of this is elegant, but this is not an elegant problem given how
Solr is structured.

Best
Erick

On Tue, Apr 24, 2012 at 2:31 PM, prakash_ajpprakash_...@yahoo.com  wrote:

I read on a couple of other web pages that fq is not supported for suggester.
I even tried the query and it doesn't help. My understanding was, when the
suggest (spellcheck) index is built, only the field chosen is considered for
queries and the other fields from the main index are not available for
filtering purposes once the index is created.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3936144.html
Sent from the Solr - User mailing list archive at Nabble.com.




QueryElevationComponent and distributed search

2012-04-24 Thread srinir
Hi,

I am using solr 3.6. I saw in Solr wiki that QueryElevationComponent is not
supported for distributed search. 

https://issues.apache.org/jira/browse/SOLR-2949

When I checked the above ticket, it looks like its fixed in Solr 4.0. Does
anyone have any idea when a stable version of solr 4.0 will be released
(approx time frame). If not, are these changes independent to other solr 4.0
changes that i can just copy this patch to my setup for now? I would like to
use solr 3.6 because i would like to use a stable version in production.


Thanks
Srini

--
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryElevationComponent-and-distributed-search-tp3936998p3936998.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread prakash_ajp
The first one may not work because the number of users can be big. Besides,
the users can simply register themselves and start using it. It won't work
if an admin has to intervene in the registration process.

The second could work I guess. But the problem would be data duplication as
users might also share permissions to same files and folders. I understand
my requirement is a little complicated.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3937368.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto suggest on indexed file content filtered based on user

2012-04-24 Thread prakash_ajp
Is it true that faceting is case sensitive? That would be disastrous for our
requirement :(

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-on-indexed-file-content-filtered-based-on-user-tp3934565p3937370.html
Sent from the Solr - User mailing list archive at Nabble.com.