Re: How to re-index Solr get term frequency within documents

2013-07-02 Thread Tony Mullins
I use Nutch as input datasource for my Solr.
So I cannot re-run all the Nutch jobs to generate data again for Solr as it
will take very long to generate that much data.

I was hoping there would be an easier way inside Solr to just re-index all
the existing data.

Thanks,
Tony


On Tue, Jul 2, 2013 at 1:37 AM, Jack Krupansky j...@basetechnology.comwrote:

 Or, go with a commercial product that has a single-click Solr re-index
 capability, such as:

 1. DataStax Enterprise - data is stored in Cassandra and reindexed into
 Solr from there.

 2. LucidWorks Search - data sources are declared so that the package can
 automatically re-crawl the data sources.

 But, yeah, as Otis says, re-index is really just a euphemism for
 deleting your Solr data directory and indexing from scratch from the
 original data sources.

 -- Jack Krupansky

 -Original Message- From: Otis Gospodnetic
 Sent: Monday, July 01, 2013 2:26 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to re-index Solr  get term frequency within documents


 If all your fields are stored, you can do it with
 http://search-lucene.com/?q=**solrentityprocessorhttp://search-lucene.com/?q=solrentityprocessor

 Otherwise, just reindex the same way you indexed in the first place.
 *Always* be ready to reindex from scratch.

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

 Thanks Jack , it worked.

 Could you please provide some info on how to re-index existing data in
 Solr, after changing the schema.xml ?

 Thanks,
 Tony


 On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  You can write any function query in the field list of the fl parameter.
 Sounds like you want termfreq:

 termfreq(field_arg,term)

 fl=id,a,b,c,termfreq(a,xyz)


 -- Jack Krupansky

 -Original Message- From: Tony Mullins
 Sent: Monday, July 01, 2013 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: How to re-index Solr  get term frequency within documents


 Hi,

 I am using Solr 4.3.0.
 If I change my solr's schema.xml then do I need to re-index my solr ? And
 if yes , how to ?

 My 2nd question is I need to find the frequency of term per document in
 all
 documents of search result.

 My field is

 field name=CommentX type=text_general stored=true indexed=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/

 And I am trying this query

 http://localhost:8080/solr/select/?q=iphonefl=AuthorX%**
 2CTitleX%2CCommentXdf=CommentXwt=xmlindent=true**
 qt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.
 offsets=truehttp://localhost:**8080/solr/select/?q=iphonefl=**
 AuthorX%2CTitleX%2CCommentX**df=CommentXwt=xmlindent=**
 trueqt=tvrhtv=truetv.tf=**truetv.df=truetv.positions**
 tv.offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true
 

 Its just returning me the result set, no info on my searched term's
 (iphone) frequency in each document.

 How can I make Solr to return the frequency of searched term per document
 in result set ?

 Thanks,
 Tony.





Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Can you please suggest a way (with example) of assigning this unique key to a
pdf file?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074588.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Okay. Can you please suggest a way (with an example) of assigning this unique
key to a pdf file. Say, a unique number to each pdf file. How do i achieve
this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr indexer and Hadoop

2013-07-02 Thread engy.morsy
Michael, 

I understand from your post that I can use the current storage without in
Hadoop. I already have the storage mounted via NFS. 
Does your map function read from the mounted storage directly? If possible
can you please illustrate more on that.

Thanks
Engy



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4074604.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr - Delta Query Via Full Import

2013-07-02 Thread Mysurf Mail
I am using DIH to fetch rows from db to solr.
I have many 1:n relations and I can do it only if I use caching (super
fast) Therefor I am adding the following attributes to my inner entity

processor=CachedSqlEntityProcessor cacheKey= cacheLookup=

Everything works great and fast. (First the n tables are queried than the
main entity.)

Now I want configured the delta import. And it is not actually working.

I know that by 
standardhttp://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example
I
need to define the following attributes:

   1. query - Initial Query
   2. DeltaQuery - The rows that were changed
   3. DeltaImportQuery - Fetch the data that was changed
   4. parentDeltaQuery - The Keys of the parent entity that has changed
   rows in the current entity

(2-4 only used in delta queries)

And I have seen in a hack in the
documentshttp://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example
that
you can do delta query via full import.
So instead of adding the following attribute -
Query,deltaImportQuery,deltaQuery -I can just add query and call full
instead of delta.

Problem - Only the first query (main entity) is executed when I run the
full import without clean.

Here is a part of my configuration in data-config.xml (I have left
deltaImportQuery though I call only full import)

entity name=PackageVersion pk=PackageVersionId
query=  select 
from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
Where '${dataimporter.request.clean}' != 'false'
OR Package.LastModificationTime 
'${dataimporter.last_index_time}' OR PackageVersion.Timestamp 
'${dataimporter.last_index_time}'
deltaImportQuery=select ...
from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
Where '${dataimporter.request.clean}' != 'false'
OR Package.LastModificationTime 
'${dataimporter.last_index_time}' OR PackageVersion.Timestamp 
'${dataimporter.last_index_time}' and
ID=='${dih.delta.id}'
entity name=PackageTag pk=ResourceId
processor=CachedSqlEntityProcessor cacheKey=ResourceId
cacheLookup=PackageVersion.PackageId
query=  SELECT ResourceId,[Text] PackageTag
from [dbo].[Tag] Tag
Where '${dataimporter.request.clean}' = 'true'
OR Tag.TimeStamp  '${dataimporter.last_index_time}'
parentDeltaQuery=select PackageVersion.PackageVersionId
  from [dbo].[Package] Package
  inner join [dbo].[PackageVersion] PackageVersion
  ON Package.Id = PackageVersion.PackageId
  where Package.Id=${PackageTag.ResourceId}
/entity
/entity


Re: Unique key error while indexing pdf files

2013-07-02 Thread Shalin Shekhar Mangar
We can't tell you what the id of your own document should be. Isn't
there anything which is unique about your pdf files? How about the
file name or the absolute path?

On Tue, Jul 2, 2013 at 11:33 AM, archit2112 archit2...@gmail.com wrote:
 Okay. Can you please suggest a way (with an example) of assigning this unique
 key to a pdf file. Say, a unique number to each pdf file. How do i achieve
 this?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Yes. The absolute path is unique.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074620.html
Sent from the Solr - User mailing list archive at Nabble.com.


Removal of unique key - Query Elevation Component

2013-07-02 Thread archit2112

I want to index pdf files in solr 4.3.0 using the data import handler.

I have done the following:

My request handler -

requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler  
lst name=defaults  
  str name=configdata-config.xml/str  
/lst  
  /requestHandler  

My data-config.xml

dataConfig  
dataSource type=BinFileDataSource /  
document  
entity name=f dataSource=null rootEntity=false 
processor=FileListEntityProcessor 
baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf 
recursive=true  
entity name=tika-test processor=TikaEntityProcessor 
url=${f.fileAbsolutePath} format=text  
field column=Author name=author meta=true/
field column=title name=title meta=true/
field column=text name=text/
/entity  
/entity  
/document  
/dataConfig  

Now when i tried to index the documents i got the following error

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id

Because i dont want any uniquekey in my case i disabled it as follows :

In solrconfig.xml i commented out -

searchComponent name=elevator class=solr.QueryElevationComponent 
pick a fieldType to analyze queries 
str name=queryFieldTypestring/str
str name=config-fileelevate.xml/str
  /searchComponent 

In schema.xml i commented out uniquekeyid/uniquekey

and added

fieldType name=uuid class=solr.UUIDField indexed=true / 
field name=id type=uuid indexed=true stored=true default=NEW /

and in elevate.xml i made the following changes

elevate
 query text=foo bar
  doc id=4602376f-9741-407b-896e-645ec3ead457 /
 /query
/elevate 

When i do this the indexing takes place but the indexed docs contain an
author,s_author and id field. The document should contain author,text,title
and id field (as defined in my data-config.xml). Please help me out. Am i
doing anything wrong? and from where did this s_author field come?

doc
str name=authorarora arc/str
str name=author_sarora arc/str
str name=id4f65332d-49d9-497a-b88b-881da618f571/str/doc





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Removal-of-unique-key-Query-Elevation-Component-tp4074624.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Removal of unique key - Query Elevation Component

2013-07-02 Thread Shalin Shekhar Mangar
My guess is that you have a copyField element which copies the
author into an author_s field.

On Tue, Jul 2, 2013 at 2:14 PM, archit2112 archit2...@gmail.com wrote:

 I want to index pdf files in solr 4.3.0 using the data import handler.

 I have done the following:

 My request handler -

 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
   str name=configdata-config.xml/str
 /lst
   /requestHandler

 My data-config.xml

 dataConfig
 dataSource type=BinFileDataSource /
 document
 entity name=f dataSource=null rootEntity=false
 processor=FileListEntityProcessor
 baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf
 recursive=true
 entity name=tika-test processor=TikaEntityProcessor
 url=${f.fileAbsolutePath} format=text
 field column=Author name=author meta=true/
 field column=title name=title meta=true/
 field column=text name=text/
 /entity
 /entity
 /document
 /dataConfig

 Now when i tried to index the documents i got the following error

 org.apache.solr.common.SolrException: Document is missing mandatory
 uniqueKey field: id

 Because i dont want any uniquekey in my case i disabled it as follows :

 In solrconfig.xml i commented out -

 searchComponent name=elevator class=solr.QueryElevationComponent 
 pick a fieldType to analyze queries
 str name=queryFieldTypestring/str
 str name=config-fileelevate.xml/str
   /searchComponent

 In schema.xml i commented out uniquekeyid/uniquekey

 and added

 fieldType name=uuid class=solr.UUIDField indexed=true /
 field name=id type=uuid indexed=true stored=true default=NEW /

 and in elevate.xml i made the following changes

 elevate
  query text=foo bar
   doc id=4602376f-9741-407b-896e-645ec3ead457 /
  /query
 /elevate

 When i do this the indexing takes place but the indexed docs contain an
 author,s_author and id field. The document should contain author,text,title
 and id field (as defined in my data-config.xml). Please help me out. Am i
 doing anything wrong? and from where did this s_author field come?

 doc
 str name=authorarora arc/str
 str name=author_sarora arc/str
 str name=id4f65332d-49d9-497a-b88b-881da618f571/str/doc





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Removal-of-unique-key-Query-Elevation-Component-tp4074624.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr indexer and Hadoop

2013-07-02 Thread Anatoli Matuskova
If you can upload your data to hdfs you can use this patch to build the solr
indexes:
https://issues.apache.org/jira/browse/SOLR-1301



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4074635.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Removal of unique key - Query Elevation Component

2013-07-02 Thread archit2112
Thanks! The author_s issue has been resolved. 
Why are the other fields not getting indexed ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Removal-of-unique-key-Query-Elevation-Component-tp4074624p4074636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-02 Thread archit2112
Yes. The absolute path is unique. How do i implement it? can you please
explain?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html
Sent from the Solr - User mailing list archive at Nabble.com.


need distance in miles not in kilometers

2013-07-02 Thread irshad siddiqui
Hi,


I am suing solr 4.2 and my results are coming proper.

but now i want to distance in miles and i am getting the distance
 in kilometre.

can anyone tell me how to get the  distance in miles.

example query

q=*:*fq={!geofilt}sfield=latlngpt=18.9322453,72.8264378001d=60fl=_dist_:geodist()sort=geodist()
desc


url
http://wiki.apache.org/solr/SpatialSearch


Thanks in advance.

Regards,
Irshad


Re: OOM killer script woes

2013-07-02 Thread Daniel Collins
On looking at the code in SolrDispatchFilter, is this intentional or not?
 I think I remember Mark Miller mentioning that in an OOM case, the best
course of action is basically to kill the process, there is very little
Solr can do once it has run out of memory.  Yet it seems that Solr catches
the OOM itself and just logs it as an error, rather than letting it go back
up the to the JVM.

We have also seem OOMs in IndexWriter and that has specific code to handle
OOM cases, and seems to fall-back to the transaction log (but fail
committing anything).  I understand the logic of that, but in reality, I've
seen the tlog can get corrupted in this case, so we still need to be
monitoring the system and forcibly kill the process.



On 27 June 2013 00:03, Timothy Potter thelabd...@gmail.com wrote:

 Thanks for the feedback Daniel ... For now, I've opted to just kill
 the JVM with System.exit(1) in the SolrDispatchFilter code and will
 restart it with a Linux supervisor. Not elegant but the alternative of
 having a zombie Solr instance walking around my cluster is much worse
 ;-) Will try to dig into the code that is trapping this error but for
 now I've lost too many hours on this problem.

 Cheers,
 Tim

 On Wed, Jun 26, 2013 at 2:43 PM, Daniel Collins danwcoll...@gmail.com
 wrote:
  Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and
  throwing it/packaging it as a java.lang.RuntimeException.  The -XX option
  assumes that the application doesn't handle the Errors and so they would
  reach the JVM and thus invoke the handler.
  Since Jetty has an exception handler that is dealing with anything
  (included Errors), they never reach the JVM, hence no handler.
 
  Not much we can do short of not using Jetty?
 
  That's a pain, I'd just written a nice OOM handler too!
 
 
  On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote:
 
  A little more to this ...
 
  Just on chance this was a weird Jetty issue or something, I tried with
  the latest 9 and the problem still occurs :-(
 
  This is on Java 7 on debian:
 
  java version 1.7.0_21
  Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
  Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
 
  Here is an example stack trace from the log
 
  2013-06-26 19:31:33,801 [qtp632640515-62] ERROR
  solr.servlet.SolrDispatchFilter Q:22 -
  null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
  space
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
  at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
  at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
  at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
  at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
  at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
  at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
  at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
  at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
  at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
  at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
  at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
  at org.eclipse.jetty.server.Server.handle(Server.java:445)
  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
  at
 
 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
  at
 
 org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
  at java.lang.Thread.run(Thread.java:722)
  Caused by: java.lang.OutOfMemoryError: Java heap space
 
  On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com
  wrote:
   Recently upgraded to 4.3.1 but this problem has persisted for a while
  now ...
  
   I'm using the following configuration when starting Jetty:
  
   -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p
  
   If an OOM is triggered during Solr web app initialization (such as by
   me lowering -Xmx to a value that is too low to initialize Solr with),
   then the script gets called and does what I expect!
  
   However, once the Solr webapp initializes 

Aggregate TermFrequency on Result Grouping / Field Collapsing

2013-07-02 Thread Tony Mullins
Hi,

Is it possible to perform aggregated termfreq(field,term) on Result
Grouping ?

I am trying to get total count of term's appearance in a document and then
want to aggregate that count by grouping the document on one of my field.

Like this

http://localhost:8080/solr/collection1/select?q=iphonewt=jsonindent=truegroup=truegroup.field=urlfl=freq%3Atermfreq%28CommentX%2C%27iphone%27%29

Problem is it returning only top level result (doc) in each group and thus
the term frequency of that result (doc).

How can I make it to sum the termfred() of all the documents per group ?

Thanks,
Tony


undefined field http:// while searchi query

2013-07-02 Thread aniljayanti
Hi,

I am using solr 3.3 version. After indexing I am querying below command.

http://localhost:8080/solr/select/?q=*(http://www.google.co.in)*

getting below error.

SEVERE: org.apache.solr.common.SolrException: *undefined field http://*
at
org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1254)
at
org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getAnalyzer(IndexSchema.java:410)
at
org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:385)
at
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:574)
at
org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:158)
at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421)
at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
at
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
at 
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:257)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
at
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1764)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

can you please assist me in this.

thanks in advance.

Aniljayanti.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/undefined-field-http-while-searchi-query-tp4074601.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr 4.3 Pivot Performance Issue

2013-07-02 Thread solrUserJM
Hi There,

I notice with the upgrade from solr 4.0 to solr 4.3 that we had a
degradation of queries that are using pivot fields. Have someone else notice
it too?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-tp4074617.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No date.gap on pivoted facets

2013-07-02 Thread Dotan Cohen
On Sun, Jun 30, 2013 at 5:33 PM, Jack Krupansky j...@basetechnology.com wrote:
 Sorry, but Solr pivot faceting is based solely on field facets, not
 range (or date) facets.


Thank you. I tried adding that information to the
SimpleFacetParameters wiki page, but that page seems to be defined as
Immutable Page.


 You can approximate date gaps by making a copy of your raw date field and
 then manually gap (truncate) the date values so that the their discrete
 values correspond to your date gap.


Thank you, this is what I have done.


 In the next release of my book I have a script for a
 StatelessScriptUpdateProccessor (with examples) that supports truncation of
 dates to a desired resolution, copying or modifying the input date as
 desired.


Terrific, I anticipate the release. Next release? Did I miss the release?
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957/

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Spell check in SOLR

2013-07-02 Thread Prathik Puthran
Hi,

How can i configure SOLR to provide corrections for misspelled words. If
the query string is in dictionary SOLR should not return any suggestions.
But if the query string is not in dictionary SOLR should return all
possible corrected words in the dictionary which most likely could be the
query string?

Thanks,
Prathik


RE: undefined field http:// while searchi query

2013-07-02 Thread Markus Jelsma
colons need to be escaped
cheers

 
 
-Original message-
 From:aniljayanti aniljaya...@yahoo.co.in
 Sent: Tuesday 2nd July 2013 12:35
 To: solr-user@lucene.apache.org
 Subject: undefined field http:// while searchi query
 
 Hi,
 
 I am using solr 3.3 version. After indexing I am querying below command.
 
 http://localhost:8080/solr/select/?q=*(http://www.google.co.in)*
 
 getting below error.
 
 SEVERE: org.apache.solr.common.SolrException: *undefined field http://*
   at
 org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1254)
   at
 org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getAnalyzer(IndexSchema.java:410)
   at
 org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:385)
   at
 org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:574)
   at
 org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:158)
   at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421)
   at 
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309)
   at 
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
   at 
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
   at 
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
   at 
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
   at 
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
   at
 org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
   at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
   at org.apache.solr.search.QParser.getQuery(QParser.java:142)
   at
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81)
   at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
   at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
   at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
   at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
   at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
   at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
   at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
   at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
   at
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
   at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
   at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
   at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
   at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
   at
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:257)
   at
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
   at
 org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1764)
   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)
 
 can you please assist me in this.
 
 thanks in advance.
 
 Aniljayanti.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/undefined-field-http-while-searchi-query-tp4074601.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


parent Import Query doent run

2013-07-02 Thread Mysurf Mail
I have 1:n relation between my main entity(PackageVersion) and its tag in
my DB.

I add a new tag with this date to the db at the timestamp and I run delta
import command.
the select retrieves the line but i dont see any other sql.
Here are my data-config.xml configurations:

entity name=PackageVersion pk=PackageVersionId
query=  select ...
from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
deltaQuery = select PackageVersion.Id PackageVersionId
  from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
  where Package.LastModificationTime 
'${dataimporter.last_index_time}' OR PackageVersion.Timestamp 
'${dataimporter.last_index_time}'
deltaImportQuery=select ...
  from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
  Where PackageVersionId=='${dih.delta.id}' 

entity name=PackageTag pk=ResourceId
processor=CachedSqlEntityProcessor cacheKey=ResourceId
cacheLookup=PackageVersion.PackageId
query=  SELECT ResourceId,[Text] PackageTag
 from [dbo].[Tag] Tag
deltaQuery=SELECT ResourceId,[Text] PackageTag
from [dbo].[Tag] Tag
Where Tag.TimeStamp 
'${dataimporter.last_index_time}'
parentDeltaQuery=select PackageVersion.PackageVersionId
from [dbo].[Package]
where
Package.Id=${PackageVersion.PackageVersionId}
/entity
/entity


Re: undefined field http:// while searchi query

2013-07-02 Thread Daniel Collins
Presuming that uses the standard lucene query parser syntax then you have
asked to query for the field called http, searching for the value //
www.google.co.in
See http://wiki.apache.org/solr/SolrQuerySyntax for more details, but you
probably want to escape the : at least, http\://www.google.co.in



On 2 July 2013 07:34, aniljayanti aniljaya...@yahoo.co.in wrote:

 Hi,

 I am using solr 3.3 version. After indexing I am querying below command.

 http://localhost:8080/solr/select/?q=*(http://www.google.co.in)*

 getting below error.

 SEVERE: org.apache.solr.common.SolrException: *undefined field http://*
 at

 org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1254)
 at

 org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getAnalyzer(IndexSchema.java:410)
 at

 org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:385)
 at

 org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:574)
 at

 org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:158)
 at
 org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421)
 at
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
 at
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
 at
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
 at

 org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
 at
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
 at
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
 at org.apache.solr.search.QParser.getQuery(QParser.java:142)
 at

 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81)
 at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at

 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
 at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
 at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
 at

 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:257)
 at

 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
 at

 org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1764)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)

 can you please assist me in this.

 thanks in advance.

 Aniljayanti.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/undefined-field-http-while-searchi-query-tp4074601.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Stemming query in Solr

2013-07-02 Thread Erick Erickson
Somehow we're mis-communicating here. Forget expansion,
it's all about base forms. G.

bq: What I cannot figure out is how is this going to help me in instructing
Solr to execute the query for the different grammatical variations of the
input search term stem

You don't. You search the stemmed input against the stemmed
field (happens automatically by field).

So, getting hits on burn, burns, burned, burning when searching
for burning,  because both the query and index process are
working with burn. Note that the _stored_ values that get returned with
the fields are all the originals, so you see burns, burning, etc.

Your query searches against one or the other field depending
on whether you have the exact match checkbox checked or
not. You can even do a variant of searching on _both_ with
a high boos on the exact_match field which would _tend_ to
sort the documents with exact match to the top of the list.

Best
Erick


On Mon, Jul 1, 2013 at 9:12 AM, snkar soumya@zoho.com wrote:

 I was just wondering if another solution might work. If we are able to
 extract the stem of the input search term(maybe using a C# based stemmer,
 some open source implementation of the Porter algorithm) for cases where
 the stemming option is selected, and submit the query to solr as a multiple
 character wild card query with respect to the stem, it should return me all
 the different variations of the stemmed word.

 Example:

 Search Term: burning
 Stem: burn
 Modified Query: burn*
 Results: burn, burning, burns, burnt, etc.

 I am sure this is not the proper way of executing a stemming by expansion,
 but this might just get the job done. What do you think? Trying to think of
 test case where this will fail.

  On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene]
 lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote 


  bq:  But looks like it is executing the search for an exact text based
 match with the stem burn.

 Right. You need to appreciate index time as opposed to query time stemming.
 Your field
 definition has both turned on. The admin/analysis page will help here
 lt;Ggt;..

 At index time, the terms are stemmed, and _only_ the reduced term is put in
 the index.
 At query time, the same thing happens and _only_ the reduced term is
 searched for.

 By stemming at index time, you lose the original form of the word, it's
 just gone and
 nothing about checking/unchecking the stem bits will recover it. So the
 general
 solution is to index the field twice, once with stemming and once without
 in order
 to have the ability to do both stemmed and exact matches. I think I saw a
 clever
 approach to doing this involving a custom filter but can't find it now. As
 I recall it
 indexed the un-stemmed version like a synonym with some kind of marker
 to indicate exact match when necessary

 Best
 Erick


 On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote:

 gt; Hi Erick,
 gt;
 gt; Thanks for the reply.
 gt;
 gt; Here is what the situation is:
 gt;
 gt; Relevant portion of Solr Schema:
 gt; amp;lt;field name=Content type=text_general indexed=false
 stored=true
 gt; required=true/amp;gt;
 gt; amp;lt;field name=ContentSearch type=text_general indexed=true
 gt; stored=false multiValued=true/amp;gt;
 gt; amp;lt;field name=ContentSearchStemming type=text_stem
 indexed=true
 gt; stored=false multiValued=true/amp;gt;
 gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt;
 gt; amp;lt;copyField source=Content
 dest=ContentSearchStemming/amp;gt;
 gt;
 gt; amp;lt;fieldType name=text_general class=solr.TextField
 gt; positionIncrementGap=100amp;gt; amp;lt;analyzer
 type=indexamp;gt; amp;lt;tokenizer
 gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter
 gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt
 gt; enablePositionIncrements=true /amp;gt; amp;lt;filter
 gt; class=solr.LowerCaseFilterFactory/amp;gt;
 amp;lt;/analyzeramp;gt; amp;lt;analyzer
 gt; type=queryamp;gt; amp;lt;tokenizer
 class=solr.StandardTokenizerFactory/amp;gt;
 gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true
 gt; words=stopwords.txt enablePositionIncrements=true /amp;gt;
 amp;lt;filter
 gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt;
 gt; amp;lt;/fieldTypeamp;gt;
 gt;
 gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt;
 gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer
 class=solr.WhitespaceTokenizerFactory/amp;gt;
 gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt;
 amp;lt;/analyzeramp;gt;
 gt; amp;lt;/fieldTypeamp;gt;
 gt; When I am indexing a document, the content gets stored as is in the
 gt; Content field and gets copied over to ContentSearch and
 gt; ContentSearchStemming for text based search and stemming search
 gt; respectively. So, the ContentSearchStemming field does store the
 gt; stem/reduced form of the terms. I have checked this with the Luke as
 well
 gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin
 gt; 

Re: documentCache not used in 4.3.1?

2013-07-02 Thread Erick Erickson
This takes some significant custom code, but...

One strategy is to keep your commits relatively
lengthy (depends on the ingest rate) and keep
a side car index either a small core or a
RAMDirectory. Then at search time you somehow
combine the two results. The somehow is a
bit tricky since the scores may not  be comparable.
If you're sorting it's trivial, but what you describe
doesn't sound like it's sorted as opposed to score.
Or more accurately, it sounds like you're sorting
by score.

But none of that is worthwhile if you're getting
good enough results as it stands.

Best
Erick


On Mon, Jul 1, 2013 at 12:28 PM, Daniel Collins danwcoll...@gmail.comwrote:

 Regrettably, visibility is key for us :(  Documents must be searchable as
 soon as they have been indexed (or as near as we can make it).  Our old
 search system didn't do relevance sort, it was time-ordered (so it had a
 much simpler job) but it did have sub-second latency, and that is what is
 expected for its replacement (I know Solr doesn't like 1s currently, but
 we live in hope!).  Tried explaining that by doing relevance sort we are
 searching 100% of the collection, instead of the ~10%-20% a time-ordered
 sort did (it effectively sharded by date and only searched as far back as
 it needed to fill a page of results), but that tends to get blank looks
 from business. :)

 One of life's little challenges.


 On 1 July 2013 11:10, Erick Erickson erickerick...@gmail.com wrote:

  Daniel:
 
  Soft commits invalidate the top level caches, which include
  things like filterCache, queryResultCache etc. Various
  segment-level caches are NOT invalidated, but you really
  don't have a lot of control from the Solr level over those
  anyway.
 
  But yeah, the tension between caching a bunch of stuff
  for query speedups and NRT is still with us. Soft commits
  are much less expensive than hard commits, but not being
  able to use the caches as much is the price. You're right
  that with such frequent autocommits, autowarming
  probably is not worth the effort.
 
  The question I always ask is whether 1 second is really
  necessary. Or, more accurately, worth the price. Often
  it's not and lengthening it out significantly may be an option,
  but that's a discussion for you to have with your product
  manager G
 
  I have seen configurations that have a more frequent hard
  commit (openSearcher=false) than soft commit. The
  mantra is soft commits are about visibility, hard commits
  are about durability.
 
  FWIW,
  Erick
 
 
  On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.com
  wrote:
 
   We see similar results, again we softCommit every 1s (trying to get as
  NRT
   as we can), and we very rarely get any hits in our caches.  As an
   unscheduled test last week, we did shutdown indexing and noticed about
  80%
   hit rate in caches (and average query time dropped from ~1s to 100ms!)
  so I
   think we are in the same position as you.
  
   I appreciate with such a frequent soft commit that the caches get
   invalidated, but I was expecting cache warming to help though it
 doesn't
   appear to be.  We *don't* currently run a warming query, my impression
 of
   NRT was that it was better to not do that as otherwise you spend more
  time
   warming the searcher and caches, and by the time you've done all that,
  the
   searcher is invalidated anyway!
  
  
   On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote:
  
That's a good idea, I'll try that next week.
   
Thanks!
   
Tim
   
   
On 29/06/13 12:39 PM, Erick Erickson wrote:
   
Tim:
   
Yeah, this doesn't make much sense to me either since,
as you say, you should be seeing some metrics upon
occasion. But do note that the underlying cache only gets
filled when getting documents to return in query results,
since there's no autowarming going on it may come and
go.
   
But you can test this pretty quickly by lengthening your
autocommit interval or just not indexing anything
for a while, then run a bunch of queries and look at your
cache stats. That'll at least tell you whether it works at all.
You'll have to have hard commits turned off (or openSearcher
set to 'false') for that check too.
   
Best
Erick
   
   
On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim
  tvaillanco...@ea.com
   *
*wrote:
   
 Yes, we are softCommit'ing every 1000ms, but that should be enough
  time
to
see metrics though, right? For example, I still get non-cumulative
metrics
from the other caches (which are also throw away). I've also
   curl/sampled
enough that I probably should have seen a value by now.
   
If anyone else can reproduce this on 4.3.1 I will feel less crazy
 :).
   
Cheers,
   
Tim
   
-Original Message-
From: Erick Erickson [mailto:erickerickson@gmail.**com
   erickerick...@gmail.com
]
Sent: Saturday, June 29, 2013 10:13 AM
To: 

Re: Converting nested data model to solr schema

2013-07-02 Thread adfel70
As you see it, does SOLR-3076 fixes my problem?

Is SOLR-3076 fix getting into solr 4.4?


Mikhail Khludnev wrote
 On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt;

 adfel70@

 gt; wrote:
 
 This requires me to override the solr document distribution mechanism.
 I fear that with this solution I may loose some of solr cloud's
 capabilities.

 
 It's not clear whether you aware of
 http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what
 you
 did doesn't sound scary to me. If it works, it should be fine. I'm not
 aware of any capabilities that you are going to loose.
 Obviously SOLR-3076 provides astonishing query time performance, with
 offloading actual join work into index time. Check it if you current
 approach turns slow.
 
 
 -- 
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 lt;http://www.griddynamics.comgt;
  lt;

 mkhludnev@

 gt;





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.3 Pivot Performance Issue

2013-07-02 Thread Jack Krupansky

What is the nature of your degradation?

-- Jack Krupansky

-Original Message- 
From: solrUserJM

Sent: Tuesday, July 02, 2013 4:22 AM
To: solr-user@lucene.apache.org
Subject: Solr 4.3 Pivot Performance Issue

Hi There,

I notice with the upgrade from solr 4.0 to solr 4.3 that we had a
degradation of queries that are using pivot fields. Have someone else notice
it too?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-tp4074617.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: need distance in miles not in kilometers

2013-07-02 Thread Jack Krupansky

Simply multiply by the number of miles per kilometer, 0.621371:

fl=_dist_:mul(geodist(),0.621371)

-- Jack Krupansky

-Original Message- 
From: irshad siddiqui

Sent: Tuesday, July 02, 2013 5:19 AM
To: solr-user@lucene.apache.org
Subject: need distance in miles not in kilometers

Hi,


I am suing solr 4.2 and my results are coming proper.

but now i want to distance in miles and i am getting the distance
in kilometre.

can anyone tell me how to get the  distance in miles.

example query

q=*:*fq={!geofilt}sfield=latlngpt=18.9322453,72.8264378001d=60fl=_dist_:geodist()sort=geodist()
desc


url
http://wiki.apache.org/solr/SpatialSearch


Thanks in advance.

Regards,
Irshad 



Re: need distance in miles not in kilometers

2013-07-02 Thread irshad siddiqui
Jack ,
Thanks for your response.
In case of frange we donot want to separately multiple for conversion so in
that case is there any way to convert it into miles.

my Query:
http://localhost:8983/solr/select?q=name:shopfl=name,shopLocation,shopMaxDeliveryDistance,geodist%28shopLocation,0.0,0.0%29sort=geodist%28shopLocation,0.0,0.0%29%20ascfq={!
frange
%20u=0}sub%28geodist%28shopLocation,0.0,0.0%29,shopMaxDeliveryDistance%29http://www.google.com/url?q=http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fselect%3Fq%3Dname%3Ashop%26fl%3Dname%2CshopLocation%2CshopMaxDeliveryDistance%2Cgeodist%2528shopLocation%2C0.0%2C0.0%2529%26sort%3Dgeodist%2528shopLocation%2C0.0%2C0.0%2529%2520asc%26fq%3D%7B%21frange%2520u%3D0%7Dsub%2528geodist%2528shopLocation%2C0.0%2C0.0%2529%2CshopMaxDeliveryDistance%2529sa=Dsntz=1usg=AFQjCNGIj4_FQ0XKwC8RPLb_iZrzOY-SZg

I wants result in miles






On Tue, Jul 2, 2013 at 6:11 PM, Jack Krupansky j...@basetechnology.comwrote:

 Simply multiply by the number of miles per kilometer, 0.621371:

 fl=_dist_:mul(geodist(),0.**621371)

 -- Jack Krupansky

 -Original Message- From: irshad siddiqui
 Sent: Tuesday, July 02, 2013 5:19 AM
 To: solr-user@lucene.apache.org
 Subject: need distance in miles not in kilometers


 Hi,


 I am suing solr 4.2 and my results are coming proper.

 but now i want to distance in miles and i am getting the distance
 in kilometre.

 can anyone tell me how to get the  distance in miles.

 example query

 q=*:*fq={!geofilt}sfield=**latlngpt=18.9322453,72.**
 8264378001d=60fl=_dist_:**geodist()sort=geodist()
 desc


 url
 http://wiki.apache.org/solr/**SpatialSearchhttp://wiki.apache.org/solr/SpatialSearch


 Thanks in advance.

 Regards,
 Irshad



Re: documentCache not used in 4.3.1?

2013-07-02 Thread Daniel Collins
Cheers, its certainly something we might end up exploring.


On 2 July 2013 12:41, Erick Erickson erickerick...@gmail.com wrote:

 This takes some significant custom code, but...

 One strategy is to keep your commits relatively
 lengthy (depends on the ingest rate) and keep
 a side car index either a small core or a
 RAMDirectory. Then at search time you somehow
 combine the two results. The somehow is a
 bit tricky since the scores may not  be comparable.
 If you're sorting it's trivial, but what you describe
 doesn't sound like it's sorted as opposed to score.
 Or more accurately, it sounds like you're sorting
 by score.

 But none of that is worthwhile if you're getting
 good enough results as it stands.

 Best
 Erick


 On Mon, Jul 1, 2013 at 12:28 PM, Daniel Collins danwcoll...@gmail.com
 wrote:

  Regrettably, visibility is key for us :(  Documents must be searchable as
  soon as they have been indexed (or as near as we can make it).  Our old
  search system didn't do relevance sort, it was time-ordered (so it had a
  much simpler job) but it did have sub-second latency, and that is what is
  expected for its replacement (I know Solr doesn't like 1s currently, but
  we live in hope!).  Tried explaining that by doing relevance sort we are
  searching 100% of the collection, instead of the ~10%-20% a time-ordered
  sort did (it effectively sharded by date and only searched as far back as
  it needed to fill a page of results), but that tends to get blank looks
  from business. :)
 
  One of life's little challenges.
 
 
  On 1 July 2013 11:10, Erick Erickson erickerick...@gmail.com wrote:
 
   Daniel:
  
   Soft commits invalidate the top level caches, which include
   things like filterCache, queryResultCache etc. Various
   segment-level caches are NOT invalidated, but you really
   don't have a lot of control from the Solr level over those
   anyway.
  
   But yeah, the tension between caching a bunch of stuff
   for query speedups and NRT is still with us. Soft commits
   are much less expensive than hard commits, but not being
   able to use the caches as much is the price. You're right
   that with such frequent autocommits, autowarming
   probably is not worth the effort.
  
   The question I always ask is whether 1 second is really
   necessary. Or, more accurately, worth the price. Often
   it's not and lengthening it out significantly may be an option,
   but that's a discussion for you to have with your product
   manager G
  
   I have seen configurations that have a more frequent hard
   commit (openSearcher=false) than soft commit. The
   mantra is soft commits are about visibility, hard commits
   are about durability.
  
   FWIW,
   Erick
  
  
   On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.com
   wrote:
  
We see similar results, again we softCommit every 1s (trying to get
 as
   NRT
as we can), and we very rarely get any hits in our caches.  As an
unscheduled test last week, we did shutdown indexing and noticed
 about
   80%
hit rate in caches (and average query time dropped from ~1s to
 100ms!)
   so I
think we are in the same position as you.
   
I appreciate with such a frequent soft commit that the caches get
invalidated, but I was expecting cache warming to help though it
  doesn't
appear to be.  We *don't* currently run a warming query, my
 impression
  of
NRT was that it was better to not do that as otherwise you spend more
   time
warming the searcher and caches, and by the time you've done all
 that,
   the
searcher is invalidated anyway!
   
   
On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com
 wrote:
   
 That's a good idea, I'll try that next week.

 Thanks!

 Tim


 On 29/06/13 12:39 PM, Erick Erickson wrote:

 Tim:

 Yeah, this doesn't make much sense to me either since,
 as you say, you should be seeing some metrics upon
 occasion. But do note that the underlying cache only gets
 filled when getting documents to return in query results,
 since there's no autowarming going on it may come and
 go.

 But you can test this pretty quickly by lengthening your
 autocommit interval or just not indexing anything
 for a while, then run a bunch of queries and look at your
 cache stats. That'll at least tell you whether it works at all.
 You'll have to have hard commits turned off (or openSearcher
 set to 'false') for that check too.

 Best
 Erick


 On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim
   tvaillanco...@ea.com
*
 *wrote:

  Yes, we are softCommit'ing every 1000ms, but that should be
 enough
   time
 to
 see metrics though, right? For example, I still get
 non-cumulative
 metrics
 from the other caches (which are also throw away). I've also
curl/sampled
 enough that I probably should have seen a value by now.

 If 

Re: Converting nested data model to solr schema

2013-07-02 Thread Jack Krupansky
It sounds like 4.4 will have an RC next week, so the prospects for block 
join in 4.4 are kind of dim. I mean, such a significant feature should have 
more than a few days to bake before getting released. But... who knows what 
Yonik has planned!


-- Jack Krupansky

-Original Message- 
From: adfel70

Sent: Tuesday, July 02, 2013 7:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Converting nested data model to solr schema

As you see it, does SOLR-3076 fixes my problem?

Is SOLR-3076 fix getting into solr 4.4?


Mikhail Khludnev wrote

On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt;



adfel70@



gt; wrote:


This requires me to override the solr document distribution mechanism.
I fear that with this solution I may loose some of solr cloud's
capabilities.



It's not clear whether you aware of
http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what
you
did doesn't sound scary to me. If it works, it should be fine. I'm not
aware of any capabilities that you are going to loose.
Obviously SOLR-3076 provides astonishing query time performance, with
offloading actual join work into index time. Check it if you current
approach turns slow.


--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

lt;http://www.griddynamics.comgt;
 lt;



mkhludnev@



gt;






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Converting nested data model to solr schema

2013-07-02 Thread adfel70
I'm not familiar with block join in lucene. I've read a bit, and I just want
to make sure - do you think that when this ticket is released, it will solve
the current problem of solr cloud joins?

Also, can you elaborate a bit about your solution?


Jack Krupansky-2 wrote
 It sounds like 4.4 will have an RC next week, so the prospects for block 
 join in 4.4 are kind of dim. I mean, such a significant feature should
 have 
 more than a few days to bake before getting released. But... who knows
 what 
 Yonik has planned!
 
 -- Jack Krupansky
 
 -Original Message- 
 From: adfel70
 Sent: Tuesday, July 02, 2013 7:41 AM
 To: 

 solr-user@.apache

 Subject: Re: Converting nested data model to solr schema
 
 As you see it, does SOLR-3076 fixes my problem?
 
 Is SOLR-3076 fix getting into solr 4.4?
 
 
 Mikhail Khludnev wrote
 On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt;
 
 adfel70@
 
 gt; wrote:

 This requires me to override the solr document distribution mechanism.
 I fear that with this solution I may loose some of solr cloud's
 capabilities.


 It's not clear whether you aware of
 http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what
 you
 did doesn't sound scary to me. If it works, it should be fine. I'm not
 aware of any capabilities that you are going to loose.
 Obviously SOLR-3076 provides astonishing query time performance, with
 offloading actual join work into index time. Check it if you current
 approach turns slow.


 -- 
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 lt;http://www.griddynamics.comgt;
  lt;
 
 mkhludnev@
 
 gt;
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html
 Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spell check in SOLR

2013-07-02 Thread Shalin Shekhar Mangar
See http://wiki.apache.org/solr/SpellCheckComponent

On Tue, Jul 2, 2013 at 4:14 PM, Prathik Puthran
prathik.puthra...@gmail.com wrote:
 Hi,

 How can i configure SOLR to provide corrections for misspelled words. If
 the query string is in dictionary SOLR should not return any suggestions.
 But if the query string is not in dictionary SOLR should return all
 possible corrected words in the dictionary which most likely could be the
 query string?

 Thanks,
 Prathik



-- 
Regards,
Shalin Shekhar Mangar.


DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Andy Pickler
Solr 4.1.0

We've been using the DIH to pull data in from a MySQL database for quite
some time now.  We're now wanting to strip all the HTML content out of many
fields using the HTMLStripTransformer (
http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
 Unfortunately, while it seems to be working fine for top-level entities,
we can't seem to get it to work for sub-entities:

(not exact schema, reduced for example purposes)

entity name=blocks dataSource=database
transformer=HTMLStripTransformer query=
  SELECT
id as blockId,
name as blockTitle,
content as content
  FROM engagement_block
  
  field column=content stripHTML=true /  *THIS WORKS!*
  entity name=blockReplies dataSource=database
transformer=HTMLStripTransformer query=
SELECT
  br.other_content AS replyContent
FROM block_reply

field column=other_content stripHTML=true / *THIS DOESN'T WORK!*
  /entity
/entity

We've tried several different permutations of putting the sub-entity column
in different nest levels of the XML to no avail.  I'm curious if we're
trying something that is just not supported or whether we are just trying
the wrong things.

Thanks,
Andy Pickler


Solr - working with delta import and cache

2013-07-02 Thread Mysurf Mail
I have two entities in 1:n relation - PackageVersion and Tag.
I have configured DIH to use CachedSqlEntityProcessor and everything works
as planned.
First, Tag entity is selected using the query attribute. Then the main
entity.
Ultra Fast.

Now I am adding the delta import.
Everything runs and loads, but too slow.
Looking at the db profiler output i see :

   1. the delta query of the inner entities run first - which is good.
   2. the delta query of the main entities runs later - which is still good.
   3. deltaImportQuery of the main entity with each of the ID's runs as a
   single select can be improved using where in all the result. Is it
   possible?
   4.

   All of the Query attribute of the other tables are running now. This is
   bad. (In real life I have more than one table in 1:n connection). for
   instance I get a lot of

   select ResourceId,[Text] PackageTag
   from [dbo].[Tag] Tag
   Where  ResourceType = 0


run. Because it is from the Query attribute - there is no where clause for
using the ids.
a. How can I fix it ?
b. Can I translate the importquery to use where in
c. There is no real order for all the select when requesting deltaImport.
is it possible to implement the caching also when updating delta?

Here is my configuration

entity name=PackageVersion pk=PackageVersionId
query=  select 
from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
deltaQuery = select PackageVersion.Id PackageVersionId
  from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
  where Package.LastModificationTime 
'${dataimporter.last_index_time}' OR PackageVersion.Timestamp 
'${dih.last_index_time}'
deltaImportQuery= select 
from [dbo].[Package] Package inner join
[dbo].[PackageVersion] PackageVersion on Package.Id =
PackageVersion.PackageId
Where PackageVersion.Id='${dih.delta.PackageVersionId}' 

entity name=PackageTag pk=ResourceId
processor=CachedSqlEntityProcessor cacheKey=ResourceId
cacheLookup=PackageVersion.PackageId
query=select ResourceId,[Text] PackageTag
   from [dbo].[Tag] Tag
   Where ResourceType = 0
deltaQuery=select ResourceId,[Text] PackageTag
from [dbo].[Tag] Tag
Where ResourceType = 0 and
Tag.TimeStamp  '${dih.last_index_time}'
parentDeltaQuery=select PackageVersion.PackageVersionId
  from [dbo].[Package]
  where
Package.Id=${PackageTag.ResourceId}
   /entity
/entity


Solr cloud date based paritioning

2013-07-02 Thread kowish.adamosh
Hi guys!

I have simple use case to implement but I have problem with date based
partitioning... Here are some rules:

1. At the beginning I have to create huge index (10GB) based on one db
table.
2. Every day I have to update this index.
3. 99,999% are queries based on date field (*data from last 2 months*).

So my idea was to create partitions by month an provide month-based
partition in query like in example in documentation:
http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001

I would provide shards only from 2 months to gain nice performance.

Questions are: how can I create month-based partitions? Is it possible to
create new shard on each new month are update delta data only to this shard?

Examples are very welcome. I read documentation a few times and can't find
answers...

Thanks!
Kowish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - working with delta import and cache

2013-07-02 Thread Mysurf Mail
BTW: Just found out that a delta import is only supported by the
SqlEntityProcessor .
Does it matter that I defined processor=CachedSqlEntityProcessor?


On Tue, Jul 2, 2013 at 5:58 PM, Mysurf Mail stammail...@gmail.com wrote:

 I have two entities in 1:n relation - PackageVersion and Tag.
 I have configured DIH to use CachedSqlEntityProcessor and everything works
 as planned.
 First, Tag entity is selected using the query attribute. Then the main
 entity.
 Ultra Fast.

 Now I am adding the delta import.
 Everything runs and loads, but too slow.
 Looking at the db profiler output i see :

1. the delta query of the inner entities run first - which is good.
2. the delta query of the main entities runs later - which is still
good.
3. deltaImportQuery of the main entity with each of the ID's runs as a
single select can be improved using where in all the result. Is it
possible?
4.

All of the Query attribute of the other tables are running now. This
is bad. (In real life I have more than one table in 1:n connection). for
instance I get a lot of

select ResourceId,[Text] PackageTag
from [dbo].[Tag] Tag
Where  ResourceType = 0


 run. Because it is from the Query attribute - there is no where clause for
 using the ids.
 a. How can I fix it ?
 b. Can I translate the importquery to use where in
 c. There is no real order for all the select when requesting deltaImport.
 is it possible to implement the caching also when updating delta?

 Here is my configuration

 entity name=PackageVersion pk=PackageVersionId
 query=  select 
 from [dbo].[Package] Package inner join 
 [dbo].[PackageVersion] PackageVersion on Package.Id = 
 PackageVersion.PackageId
 deltaQuery = select PackageVersion.Id PackageVersionId
   from [dbo].[Package] Package inner join 
 [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId
   where Package.LastModificationTime  
 '${dataimporter.last_index_time}' OR PackageVersion.Timestamp  
 '${dih.last_index_time}'
 deltaImportQuery= select 
 from [dbo].[Package] Package inner join 
 [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId
 Where PackageVersion.Id='${dih.delta.PackageVersionId}' 

 entity name=PackageTag pk=ResourceId 
 processor=CachedSqlEntityProcessor cacheKey=ResourceId 
 cacheLookup=PackageVersion.PackageId
 query=select ResourceId,[Text] PackageTag
from [dbo].[Tag] Tag
Where ResourceType = 0
 deltaQuery=select ResourceId,[Text] PackageTag
 from [dbo].[Tag] Tag
 Where ResourceType = 0 and Tag.TimeStamp 
  '${dih.last_index_time}'
 parentDeltaQuery=select 
 PackageVersion.PackageVersionId
   from [dbo].[Package]
   where 
 Package.Id=${PackageTag.ResourceId}
/entity
 /entity




How to disable debug in Solrj

2013-07-02 Thread Jean-Pierre Lauris
Hi,
I'm running the jetty start.jar and I'm indexing documents with
Solrj's HttpSolrServer object :

SolrServer server = new HttpSolrServer(http://HOST:8983/solr/;);
server.add( docs );
server.commit();

This leads to TONS of debug information (i.e. logs at level DEBUG), on
both server and client sides (but much more on client side).
I'm read and tried methods suggested in :
http://wiki.apache.org/solr/SolrLogging#Customizing_Logging
http://wiki.apache.org/solr/LoggingInDefaultJettySetup

but nothing changed.
How can I lower the debugging level to INFO or WARN?
Thanks,
Scott.


Re: Solr cloud date based paritioning

2013-07-02 Thread Gora Mohanty
On 2 July 2013 20:05, kowish.adamosh kowish.adam...@gmail.com wrote:
 Hi guys!

 I have simple use case to implement but I have problem with date based
 partitioning... Here are some rules:

 1. At the beginning I have to create huge index (10GB) based on one db
 table.
 2. Every day I have to update this index.
 3. 99,999% are queries based on date field (*data from last 2 months*).
[...]

Before you start complicating things, have you measured the
performance of having everything in one shard? It is quite
likely that a 10GB index would have adequate performance
on reasonable hardware. Your mileage may vary, but I would
try to measure the performance from a single index first.

Regards,
Gora


Re: Using per-segment FieldCache or DocValues in custom component?

2013-07-02 Thread Robert Muir
Where do you get the docid from? Usually its best to just look at the whole
algorithm, e.g. docids come from per-segment readers by default anyway so
ideally you want to access any per-document things from that same
segmentreader.

As far as supporting docvalues, FieldCache API passes thru to docvalues
transparently if its enabled for the field.

On Mon, Jul 1, 2013 at 4:55 PM, Michael Ryan mr...@moreover.com wrote:

 I have some custom code that uses the top-level FieldCache (e.g.,
 FieldCache.DEFAULT.getLongs(reader, foobar, false)). I'd like to redesign
 this to use the per-segment FieldCaches so that re-opening a Searcher is
 fast(er). In most cases, I've got a docId and I want to get the value for a
 particular single-valued field for that doc.

 Is there a good place to look to see example code of per-segment
 FieldCache use? I've been looking at PerSegmentSingleValuedFaceting, but
 hoping there might be something less confusing :)

 Also thinking DocValues might be a better way to go for me... is there any
 documentation or example code for that?

 -Michael



Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Gora Mohanty
On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote:
 Solr 4.1.0

 We've been using the DIH to pull data in from a MySQL database for quite
 some time now.  We're now wanting to strip all the HTML content out of many
 fields using the HTMLStripTransformer (
 http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
  Unfortunately, while it seems to be working fine for top-level entities,
 we can't seem to get it to work for sub-entities:

 (not exact schema, reduced for example purposes)

Please do not do that. This DIH configuration file does
not make sense (please see comments below), and we
are left guessing in the dark. If the file is too large,
you can share it on something like pastebin.com

 entity name=blocks dataSource=database
 transformer=HTMLStripTransformer query=
   SELECT
 id as blockId,
 name as blockTitle,
 content as content
   FROM engagement_block
   
   field column=content stripHTML=true /  *THIS WORKS!*
   entity name=blockReplies dataSource=database
 transformer=HTMLStripTransformer query=
 SELECT
   br.other_content AS replyContent
 FROM block_reply
 
 field column=other_content stripHTML=true / *THIS DOESN'T WORK!*
[...]

(a) You SELECT replyContent, but the column attribute
 in the field is named other_content. Nothing should
 be getting indexed into the field.
(b) Why are your entities nested if the inner entity has no
 relationship to the outer one?

Regards,
Gora


Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Andy Pickler
Thanks for the quick reply.  Unfortunately, I don't believe my company
would want me sharing our exact production schema in a public forum,
although I realize it makes it harder to diagnose the problem.  The
sub-entity is a multi-valued field that indeed does have a relationship to
the outer entity.  I just left off the 'where' clause from the sub-entity,
as I didn't believe it was helpful in the context of this problem.  We use
the convention of..

SELECT dbColumnName AS solrFieldName

...so that we can relate the database column name to what we what it to be
named in the Solr index.

I don't think any of this helps you identify my problem, but I tried to
address your questions.

Thanks,
Andy

On Tue, Jul 2, 2013 at 9:14 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote:
  Solr 4.1.0
 
  We've been using the DIH to pull data in from a MySQL database for quite
  some time now.  We're now wanting to strip all the HTML content out of
 many
  fields using the HTMLStripTransformer (
  http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer).
   Unfortunately, while it seems to be working fine for top-level
 entities,
  we can't seem to get it to work for sub-entities:
 
  (not exact schema, reduced for example purposes)

 Please do not do that. This DIH configuration file does
 not make sense (please see comments below), and we
 are left guessing in the dark. If the file is too large,
 you can share it on something like pastebin.com

  entity name=blocks dataSource=database
  transformer=HTMLStripTransformer query=
SELECT
  id as blockId,
  name as blockTitle,
  content as content
FROM engagement_block

field column=content stripHTML=true /  *THIS WORKS!*
entity name=blockReplies dataSource=database
  transformer=HTMLStripTransformer query=
  SELECT
br.other_content AS replyContent
  FROM block_reply
  
  field column=other_content stripHTML=true / *THIS DOESN'T
 WORK!*
 [...]

 (a) You SELECT replyContent, but the column attribute
  in the field is named other_content. Nothing should
  be getting indexed into the field.
 (b) Why are your entities nested if the inner entity has no
  relationship to the outer one?

 Regards,
 Gora



Re: Solr indexer and Hadoop

2013-07-02 Thread Michael Della Bitta
Yes, I've read directly from NFS.

Consider the case where your mapper takes as input a list of the file paths
to operate on. Your mapper would load each file one by one by using
standard java.io.* calls, build a SolrInputDocument out of each one, and
submit it to a SolrServer implementation stored as a member field in the
mapper during the setup call. Something like this:

https://gist.github.com/mdellabitta/5910253

I literally wrote that in the git editor just now, so I don't even know if
it compiles, but you can get the idea. Note that the NFS mount has to be
live on all of the task nodes. Also, if the number of lines in the input
file is small enough, Hadoop might not split it enough for you, so you
should use NLineInputFormat. And you should definitely tune the number of
running tasks to make sure that you don't destroy your Solr box with lots
of traffic.

I've used the patch that Anatoli mentions as well, and that does work.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Tue, Jul 2, 2013 at 3:17 AM, engy.morsy engy.mo...@bibalex.org wrote:

 Michael,

 I understand from your post that I can use the current storage without in
 Hadoop. I already have the storage mounted via NFS.
 Does your map function read from the mounted storage directly? If possible
 can you please illustrate more on that.

 Thanks
 Engy



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4074604.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Newbie SolR - Need advice

2013-07-02 Thread Jack Krupansky

Start with the Solr Tutorial.

http://lucene.apache.org/solr/tutorial.html

-- Jack Krupansky

-Original Message- 
From: fabio1605

Sent: Tuesday, July 02, 2013 11:16 AM
To: solr-user@lucene.apache.org
Subject: Newbie SolR - Need advice

Hi

we have a MSSQL Server which is just getting far to large now and
performance is dying! the majority of our webservers mainly are doing search
function so i thought it may be best to move to SolR But i know very little
about it!

My questions are!

Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and
SolR is just the search bit between?

Im really struggling to understand the point of SOLR etc so if someone could
point me to a Dummies website id apprecaite it! google is throwing to much
confusion at me!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: OOM killer script woes

2013-07-02 Thread Mark Miller
Please file a JIRA issue so that we can address this.

- Mark

On Jul 2, 2013, at 6:20 AM, Daniel Collins danwcoll...@gmail.com wrote:

 On looking at the code in SolrDispatchFilter, is this intentional or not?
 I think I remember Mark Miller mentioning that in an OOM case, the best
 course of action is basically to kill the process, there is very little
 Solr can do once it has run out of memory.  Yet it seems that Solr catches
 the OOM itself and just logs it as an error, rather than letting it go back
 up the to the JVM.
 
 We have also seem OOMs in IndexWriter and that has specific code to handle
 OOM cases, and seems to fall-back to the transaction log (but fail
 committing anything).  I understand the logic of that, but in reality, I've
 seen the tlog can get corrupted in this case, so we still need to be
 monitoring the system and forcibly kill the process.
 
 
 
 On 27 June 2013 00:03, Timothy Potter thelabd...@gmail.com wrote:
 
 Thanks for the feedback Daniel ... For now, I've opted to just kill
 the JVM with System.exit(1) in the SolrDispatchFilter code and will
 restart it with a Linux supervisor. Not elegant but the alternative of
 having a zombie Solr instance walking around my cluster is much worse
 ;-) Will try to dig into the code that is trapping this error but for
 now I've lost too many hours on this problem.
 
 Cheers,
 Tim
 
 On Wed, Jun 26, 2013 at 2:43 PM, Daniel Collins danwcoll...@gmail.com
 wrote:
 Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and
 throwing it/packaging it as a java.lang.RuntimeException.  The -XX option
 assumes that the application doesn't handle the Errors and so they would
 reach the JVM and thus invoke the handler.
 Since Jetty has an exception handler that is dealing with anything
 (included Errors), they never reach the JVM, hence no handler.
 
 Not much we can do short of not using Jetty?
 
 That's a pain, I'd just written a nice OOM handler too!
 
 
 On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote:
 
 A little more to this ...
 
 Just on chance this was a weird Jetty issue or something, I tried with
 the latest 9 and the problem still occurs :-(
 
 This is on Java 7 on debian:
 
 java version 1.7.0_21
 Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
 Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
 
 Here is an example stack trace from the log
 
 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR
 solr.servlet.SolrDispatchFilter Q:22 -
 null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
 space
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
 at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
 at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
 at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
 at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
 at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
 at
 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
 at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
 at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
 at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
 at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
 at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
 at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
 at org.eclipse.jetty.server.Server.handle(Server.java:445)
 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
 at
 
 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
 at
 
 org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
 at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
 at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.OutOfMemoryError: Java heap space
 
 On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com
 wrote:
 Recently upgraded to 4.3.1 but this problem has persisted for a while
 now ...
 
 I'm using the following configuration when starting Jetty:
 
 -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p
 
 If an OOM is triggered during Solr web app initialization (such as by
 me lowering -Xmx to a value that is too low to initialize Solr with),
 then the 

Re: Unique key error while indexing pdf files

2013-07-02 Thread Shalin Shekhar Mangar
See http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor

The implicit fields generated by the FileListEntityProcessor are
fileDir, file, fileAbsolutePath, fileSize, fileLastModified and these
are available for use within the entity

On Tue, Jul 2, 2013 at 2:47 PM, archit2112 archit2...@gmail.com wrote:
 Yes. The absolute path is unique. How do i implement it? can you please
 explain?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


RE: Newbie SolR - Need advice

2013-07-02 Thread David Quarterman
Hi Fabio,

Like Jack says, try the tutorial. But to answer your question, SOLR isn't a 
bolt on to SQLServer or any other DB. It's a fantastically fast 
indexing/searching tool. You'll need to use the DataImportHandler (see the 
tutorial) to import your data from the DB into the indices that SOLR uses. Once 
in there, you'll have more power  flexibility than SQLServer would ever give 
you!

Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll 
work using Jetty or Tomcat as web container.

Stick with it. The ride can be bumpy but the experience is sensational!

DQ

-Original Message-
From: fabio1605 [mailto:fabio.to...@btinternet.com] 
Sent: 02 July 2013 16:16
To: solr-user@lucene.apache.org
Subject: Newbie SolR - Need advice

Hi

we have a MSSQL Server which is just getting far to large now and performance 
is dying! the majority of our webservers mainly are doing search function so i 
thought it may be best to move to SolR But i know very little about it!

My questions are!

Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR 
is just the search bit between?

Im really struggling to understand the point of SOLR etc so if someone could 
point me to a Dummies website id apprecaite it! google is throwing to much 
confusion at me!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Newbie SolR - Need advice

2013-07-02 Thread fabio1605
Thanks guys

So SolR is actually a database replacement for mssql...  Am I right 


We have a lot of perl scripts that contains lots of sql insert queries. Etc


How do we query the SolR database from scripts  I know I have a lot to 
learn still so excuse my ignorance. 

Also...  What is mongo and how does it compare

I just don't understand how in 10years of Web development I have never heard of 
SolR till last week




Sent from Samsung Mobile

 Original message 
From: David Quarterman [via Lucene] 
ml-node+s472066n4074772...@n3.nabble.com 
Date: 02/07/2013  16:57  (GMT+00:00) 
To: fabio1605 fabio.to...@btinternet.com 
Subject: RE: Newbie SolR - Need advice 
 
Hi Fabio, 

Like Jack says, try the tutorial. But to answer your question, SOLR isn't a 
bolt on to SQLServer or any other DB. It's a fantastically fast 
indexing/searching tool. You'll need to use the DataImportHandler (see the 
tutorial) to import your data from the DB into the indices that SOLR uses. Once 
in there, you'll have more power  flexibility than SQLServer would ever give 
you! 

Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll 
work using Jetty or Tomcat as web container. 

Stick with it. The ride can be bumpy but the experience is sensational! 

DQ 

-Original Message- 
From: fabio1605 [mailto:[hidden email]] 
Sent: 02 July 2013 16:16 
To: [hidden email] 
Subject: Newbie SolR - Need advice 

Hi 

we have a MSSQL Server which is just getting far to large now and performance 
is dying! the majority of our webservers mainly are doing search function so i 
thought it may be best to move to SolR But i know very little about it! 

My questions are! 

Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR 
is just the search bit between? 

Im really struggling to understand the point of SOLR etc so if someone could 
point me to a Dummies website id apprecaite it! google is throwing to much 
confusion at me! 



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
Sent from the Solr - User mailing list archive at Nabble.com. 


If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html
To unsubscribe from Newbie SolR - Need advice, click here.
NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr cloud date based paritioning

2013-07-02 Thread Otis Gospodnetic
Hi,

There is nothing automatic that I know of that will create shards (or
maybe you mean SolrCloud Collections?) every month.  You can do that
in your application, though, just create the Collection via the API.
You can make use of aliases to have something like last2months alias
point to your last 2 Collections.  You would shift this alias every
month after you create your new Collection.  Of course, right after
the shift, you would really be searching only 1 month's worth of data,
so you may want to allow searching across last 3 Collections instead,
optionally enforcing/limiting query to last 2 months based on document
date and a range query.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 2, 2013 at 10:35 AM, kowish.adamosh
kowish.adam...@gmail.com wrote:
 Hi guys!

 I have simple use case to implement but I have problem with date based
 partitioning... Here are some rules:

 1. At the beginning I have to create huge index (10GB) based on one db
 table.
 2. Every day I have to update this index.
 3. 99,999% are queries based on date field (*data from last 2 months*).

 So my idea was to create partitions by month an provide month-based
 partition in query like in example in documentation:
 http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001

 I would provide shards only from 2 months to gain nice performance.

 Questions are: how can I create month-based partitions? Is it possible to
 create new shard on each new month are update delta data only to this shard?

 Examples are very welcome. I read documentation a few times and can't find
 answers...

 Thanks!
 Kowish



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to re-index Solr get term frequency within documents

2013-07-02 Thread Otis Gospodnetic
Hi Tony,

There is, you can do it with that SolrEntityProcessor I pointed out,
if you have all your fields stored in Solr.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 2, 2013 at 2:00 AM, Tony Mullins tonymullins...@gmail.com wrote:
 I use Nutch as input datasource for my Solr.
 So I cannot re-run all the Nutch jobs to generate data again for Solr as it
 will take very long to generate that much data.

 I was hoping there would be an easier way inside Solr to just re-index all
 the existing data.

 Thanks,
 Tony


 On Tue, Jul 2, 2013 at 1:37 AM, Jack Krupansky j...@basetechnology.comwrote:

 Or, go with a commercial product that has a single-click Solr re-index
 capability, such as:

 1. DataStax Enterprise - data is stored in Cassandra and reindexed into
 Solr from there.

 2. LucidWorks Search - data sources are declared so that the package can
 automatically re-crawl the data sources.

 But, yeah, as Otis says, re-index is really just a euphemism for
 deleting your Solr data directory and indexing from scratch from the
 original data sources.

 -- Jack Krupansky

 -Original Message- From: Otis Gospodnetic
 Sent: Monday, July 01, 2013 2:26 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to re-index Solr  get term frequency within documents


 If all your fields are stored, you can do it with
 http://search-lucene.com/?q=**solrentityprocessorhttp://search-lucene.com/?q=solrentityprocessor

 Otherwise, just reindex the same way you indexed in the first place.
 *Always* be ready to reindex from scratch.

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com
 wrote:

 Thanks Jack , it worked.

 Could you please provide some info on how to re-index existing data in
 Solr, after changing the schema.xml ?

 Thanks,
 Tony


 On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  You can write any function query in the field list of the fl parameter.
 Sounds like you want termfreq:

 termfreq(field_arg,term)

 fl=id,a,b,c,termfreq(a,xyz)


 -- Jack Krupansky

 -Original Message- From: Tony Mullins
 Sent: Monday, July 01, 2013 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: How to re-index Solr  get term frequency within documents


 Hi,

 I am using Solr 4.3.0.
 If I change my solr's schema.xml then do I need to re-index my solr ? And
 if yes , how to ?

 My 2nd question is I need to find the frequency of term per document in
 all
 documents of search result.

 My field is

 field name=CommentX type=text_general stored=true indexed=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/

 And I am trying this query

 http://localhost:8080/solr/select/?q=iphonefl=AuthorX%**
 2CTitleX%2CCommentXdf=CommentXwt=xmlindent=true**
 qt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.
 offsets=truehttp://localhost:**8080/solr/select/?q=iphonefl=**
 AuthorX%2CTitleX%2CCommentX**df=CommentXwt=xmlindent=**
 trueqt=tvrhtv=truetv.tf=**truetv.df=truetv.positions**
 tv.offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true
 

 Its just returning me the result set, no info on my searched term's
 (iphone) frequency in each document.

 How can I make Solr to return the frequency of searched term per document
 in result set ?

 Thanks,
 Tony.





Re: Newbie SolR - Need advice

2013-07-02 Thread Sandeep Mestry
Hi Fabio,

No, Solr isn't the database replacement for MS SQL.
Solr is built on top of Lucene which is a search engine library for text
searches.

Solr in itself is not a replacement for any database as it does not support
any relational db features, however as Jack and David mentioned its fully
optimised search engine platform that can provide all search related
features like faceting, highlighting etc.
Solr does not have a *database*. It stores the data in binary files called
indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These
indexes are populated with the data from the database. Solr provides an
inbuilt functionality through DataImportHandler component to get the data
and generate indexes.

When you say, your web servers are mainly doing search function, do you
mean it is a text search and you use queries with clauses as 'like', 'in'
etc. (in addition to multiple joints) to get the results? Does the web
application need faceting? If yes, then solr can be your friend to get it
through.

Do remember that it always takes some time to get the new concepts from
understanding through to implementation. As David mentioned already, it
*is* going to be a bumpy ride at the start but *definitely* a sensational
one.

Good Luck,
Sandeep



On 2 July 2013 17:09, fabio1605 fabio.to...@btinternet.com wrote:

 Thanks guys

 So SolR is actually a database replacement for mssql...  Am I right


 We have a lot of perl scripts that contains lots of sql insert
 queries. Etc


 How do we query the SolR database from scripts  I know I have a lot to
 learn still so excuse my ignorance.

 Also...  What is mongo and how does it compare

 I just don't understand how in 10years of Web development I have never
 heard of SolR till last week




 Sent from Samsung Mobile

  Original message 
 From: David Quarterman [via Lucene] 
 ml-node+s472066n4074772...@n3.nabble.com
 Date: 02/07/2013  16:57  (GMT+00:00)
 To: fabio1605 fabio.to...@btinternet.com
 Subject: RE: Newbie SolR - Need advice

 Hi Fabio,

 Like Jack says, try the tutorial. But to answer your question, SOLR isn't
 a bolt on to SQLServer or any other DB. It's a fantastically fast
 indexing/searching tool. You'll need to use the DataImportHandler (see the
 tutorial) to import your data from the DB into the indices that SOLR uses.
 Once in there, you'll have more power  flexibility than SQLServer would
 ever give you!

 Haven't tried SOLR on Windows (I guess your environment) but I'm sure
 it'll work using Jetty or Tomcat as web container.

 Stick with it. The ride can be bumpy but the experience is sensational!

 DQ

 -Original Message-
 From: fabio1605 [mailto:[hidden email]]
 Sent: 02 July 2013 16:16
 To: [hidden email]
 Subject: Newbie SolR - Need advice

 Hi

 we have a MSSQL Server which is just getting far to large now and
 performance is dying! the majority of our webservers mainly are doing
 search function so i thought it may be best to move to SolR But i know very
 little about it!

 My questions are!

 Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and
 SolR is just the search bit between?

 Im really struggling to understand the point of SOLR etc so if someone
 could point me to a Dummies website id apprecaite it! google is throwing to
 much confusion at me!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html
 To unsubscribe from Newbie SolR - Need advice, click here.
 NAML



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html
 Sent from the Solr - User mailing list archive at Nabble.com.



set-based and other less common approaches to search

2013-07-02 Thread gilawem
Let's say I wanted to ask solr to find me any document that contains at least 
100 out of some 300 search terms I give it. Can Solr do this out of the box? If 
not, what kind of customization would it require?

Now let's say I want to further have the option to request that those terms a) 
must show up within the same column of an excel spreadsheet, or b) are exact 
matches (i.e. match on search, but not searched), or c) occur in the exact 
order that I specified, or d) occur contiguously and without any words in 
between, or e) are made up of non-word elements such as 92228345 or 
SJA12334.

Can solr do any of these out of the box? If not, what of these tasks is 
relatively easy to do with some custom code, and what is not?

Re: set-based and other less common approaches to search

2013-07-02 Thread Otis Gospodnetic
Hi,

Solr can do all of these.  There are phrase queries, queries where you
specify a field, the mm param for min should match, etc.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote:
 Let's say I wanted to ask solr to find me any document that contains at least 
 100 out of some 300 search terms I give it. Can Solr do this out of the box? 
 If not, what kind of customization would it require?

 Now let's say I want to further have the option to request that those terms 
 a) must show up within the same column of an excel spreadsheet, or b) are 
 exact matches (i.e. match on search, but not searched), or c) occur in 
 the exact order that I specified, or d) occur contiguously and without any 
 words in between, or e) are made up of non-word elements such as 92228345 
 or SJA12334.

 Can solr do any of these out of the box? If not, what of these tasks is 
 relatively easy to do with some custom code, and what is not?


Tomcat Solr Server startup fails with FileNotFoundException

2013-07-02 Thread Murthy Perla
Hi All,

   I am newbie to solr. I've accidentally deleted indexed
files(manually using rm -rf command) on server from solr index folder. Then
on when ever I start my server its failing to start with FNF exception. How
can this be fixed quickly?

  Appreciate if any can suggest a quick fix to this.

INFO: created /elevate: solr.SearchHandler
Jul 1, 2013 8:17:40 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException:
/solr/index/_bbx.fnm (No such file or directory)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1099)
 I am seeing below exception as well.
 Can you please help me with these 2 exceptions? Let me know if you
need any other details on this.


2013-07-01 20:18:00 TaskUtils$LoggingErrorHandler [ERROR] Unexpected error
occurred in scheduled task.
org.apache.solr.common.SolrException: Internal Server Error
Internal Server Error
request: http://servername:8080/solr/admin/ping?wt=javabinversion=2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)

-- 
Thanks and Regards,
Murthy P D N S.


Re: Newbie SolR - Need advice

2013-07-02 Thread fabio1605
Hi

Ok I'm even more confused now...  Sorry for even more stupid questions. 

So if it's not a database replacement  Where do we keep the database 
then. 

We have a website that is a documentation website that store documents.  It has 
over 130 million records in a table and 50 million in 2 other plus lots of 
little tables

Most searches are like searching on references or for customer information etc. 

However with so much information stored ms sql is starting to get slower 

We have approx 100 tables across 4 different database 

So this is why I started to look at SolR 

Q1 if we used SolR would we still use sql as well as SolR or does SolR become 
sql (speaking theoretically) 

Q2 if so...  How do we move all the data across to SolR. 

Q3 is SolR useful for what we need.  Or is sql the better option based on 
our circumstances.  50percent of our load is from a website...  50 percent is 
from scripts adding the information to the site etc 

Sorry for the silly question I'm just getting really confused now 


Sent from Samsung Mobile

 Original message 
From: Sandeep Mestry [via Lucene] ml-node+s472066n4074795...@n3.nabble.com 
Date: 02/07/2013  17:29  (GMT+00:00) 
To: fabio1605 fabio.to...@btinternet.com 
Subject: Re: Newbie SolR - Need advice 
 
Hi Fabio, 

No, Solr isn't the database replacement for MS SQL. 
Solr is built on top of Lucene which is a search engine library for text 
searches. 

Solr in itself is not a replacement for any database as it does not support 
any relational db features, however as Jack and David mentioned its fully 
optimised search engine platform that can provide all search related 
features like faceting, highlighting etc. 
Solr does not have a *database*. It stores the data in binary files called 
indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These 
indexes are populated with the data from the database. Solr provides an 
inbuilt functionality through DataImportHandler component to get the data 
and generate indexes. 

When you say, your web servers are mainly doing search function, do you 
mean it is a text search and you use queries with clauses as 'like', 'in' 
etc. (in addition to multiple joints) to get the results? Does the web 
application need faceting? If yes, then solr can be your friend to get it 
through. 

Do remember that it always takes some time to get the new concepts from 
understanding through to implementation. As David mentioned already, it 
*is* going to be a bumpy ride at the start but *definitely* a sensational 
one. 

Good Luck, 
Sandeep 



On 2 July 2013 17:09, fabio1605 [hidden email] wrote: 

 Thanks guys 
 
 So SolR is actually a database replacement for mssql...  Am I right 
 
 
 We have a lot of perl scripts that contains lots of sql insert 
 queries. Etc 
 
 
 How do we query the SolR database from scripts  I know I have a lot to 
 learn still so excuse my ignorance. 
 
 Also...  What is mongo and how does it compare 
 
 I just don't understand how in 10years of Web development I have never 
 heard of SolR till last week 
 
 
 
 
 Sent from Samsung Mobile 
 
  Original message  
 From: David Quarterman [via Lucene]  
 [hidden email] 
 Date: 02/07/2013  16:57  (GMT+00:00) 
 To: fabio1605 [hidden email] 
 Subject: RE: Newbie SolR - Need advice 
 
 Hi Fabio, 
 
 Like Jack says, try the tutorial. But to answer your question, SOLR isn't 
 a bolt on to SQLServer or any other DB. It's a fantastically fast 
 indexing/searching tool. You'll need to use the DataImportHandler (see the 
 tutorial) to import your data from the DB into the indices that SOLR uses. 
 Once in there, you'll have more power  flexibility than SQLServer would 
 ever give you! 
 
 Haven't tried SOLR on Windows (I guess your environment) but I'm sure 
 it'll work using Jetty or Tomcat as web container. 
 
 Stick with it. The ride can be bumpy but the experience is sensational! 
 
 DQ 
 
 -Original Message- 
 From: fabio1605 [mailto:[hidden email]] 
 Sent: 02 July 2013 16:16 
 To: [hidden email] 
 Subject: Newbie SolR - Need advice 
 
 Hi 
 
 we have a MSSQL Server which is just getting far to large now and 
 performance is dying! the majority of our webservers mainly are doing 
 search function so i thought it may be best to move to SolR But i know very 
 little about it! 
 
 My questions are! 
 
 Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and 
 SolR is just the search bit between? 
 
 Im really struggling to understand the point of SOLR etc so if someone 
 could point me to a Dummies website id apprecaite it! google is throwing to 
 much confusion at me! 
 
 
 
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
 Sent from the Solr - User mailing list archive at Nabble.com. 
 
 
 If you reply to this email, your message will be added to the discussion 
 below: 
 
 

Re: Newbie SolR - Need advice

2013-07-02 Thread Jack Krupansky
Consider DataStax Enterprise - it combines Cassandra for NoSql data storage 
with Solr for indexing - fully integrated.


http://www.datastax.com/

-- Jack Krupansky

-Original Message- 
From: fabio1605

Sent: Tuesday, July 02, 2013 12:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Newbie SolR - Need advice

Hi

Ok I'm even more confused now...  Sorry for even more stupid questions.

So if it's not a database replacement  Where do we keep the database 
then.


We have a website that is a documentation website that store documents.  It 
has over 130 million records in a table and 50 million in 2 other plus lots 
of little tables


Most searches are like searching on references or for customer information 
etc.


However with so much information stored ms sql is starting to get slower

We have approx 100 tables across 4 different database

So this is why I started to look at SolR

Q1 if we used SolR would we still use sql as well as SolR or does SolR 
become sql (speaking theoretically)


Q2 if so...  How do we move all the data across to SolR.

Q3 is SolR useful for what we need.  Or is sql the better option based 
on our circumstances.  50percent of our load is from a website...  50 
percent is from scripts adding the information to the site etc


Sorry for the silly question I'm just getting really confused now


Sent from Samsung Mobile

 Original message 
From: Sandeep Mestry [via Lucene] 
ml-node+s472066n4074795...@n3.nabble.com

Date: 02/07/2013  17:29  (GMT+00:00)
To: fabio1605 fabio.to...@btinternet.com
Subject: Re: Newbie SolR - Need advice

Hi Fabio,

No, Solr isn't the database replacement for MS SQL.
Solr is built on top of Lucene which is a search engine library for text
searches.

Solr in itself is not a replacement for any database as it does not support
any relational db features, however as Jack and David mentioned its fully
optimised search engine platform that can provide all search related
features like faceting, highlighting etc.
Solr does not have a *database*. It stores the data in binary files called
indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These
indexes are populated with the data from the database. Solr provides an
inbuilt functionality through DataImportHandler component to get the data
and generate indexes.

When you say, your web servers are mainly doing search function, do you
mean it is a text search and you use queries with clauses as 'like', 'in'
etc. (in addition to multiple joints) to get the results? Does the web
application need faceting? If yes, then solr can be your friend to get it
through.

Do remember that it always takes some time to get the new concepts from
understanding through to implementation. As David mentioned already, it
*is* going to be a bumpy ride at the start but *definitely* a sensational
one.

Good Luck,
Sandeep



On 2 July 2013 17:09, fabio1605 [hidden email] wrote:


Thanks guys

So SolR is actually a database replacement for mssql...  Am I right


We have a lot of perl scripts that contains lots of sql insert
queries. Etc


How do we query the SolR database from scripts  I know I have a lot to
learn still so excuse my ignorance.

Also...  What is mongo and how does it compare

I just don't understand how in 10years of Web development I have never
heard of SolR till last week




Sent from Samsung Mobile

 Original message  
From: David Quarterman [via Lucene] 

[hidden email]
Date: 02/07/2013  16:57  (GMT+00:00)
To: fabio1605 [hidden email]
Subject: RE: Newbie SolR - Need advice

Hi Fabio,

Like Jack says, try the tutorial. But to answer your question, SOLR isn't
a bolt on to SQLServer or any other DB. It's a fantastically fast
indexing/searching tool. You'll need to use the DataImportHandler (see the
tutorial) to import your data from the DB into the indices that SOLR uses.
Once in there, you'll have more power  flexibility than SQLServer would
ever give you!

Haven't tried SOLR on Windows (I guess your environment) but I'm sure
it'll work using Jetty or Tomcat as web container.

Stick with it. The ride can be bumpy but the experience is sensational!

DQ

-Original Message- 
From: fabio1605 [mailto:[hidden email]]

Sent: 02 July 2013 16:16
To: [hidden email]
Subject: Newbie SolR - Need advice

Hi

we have a MSSQL Server which is just getting far to large now and
performance is dying! the majority of our webservers mainly are doing
search function so i thought it may be best to move to SolR But i know 
very

little about it!

My questions are!

Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and
SolR is just the search bit between?

Im really struggling to understand the point of SOLR etc so if someone
could point me to a Dummies website id apprecaite it! google is throwing 
to

much confusion at me!



--
View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
Sent 

Re: Newbie SolR - Need advice

2013-07-02 Thread Shawn Heisey
On 7/2/2013 10:09 AM, fabio1605 wrote:
 Thanks guys
 
 So SolR is actually a database replacement for mssql...  Am I right 
 
 
 We have a lot of perl scripts that contains lots of sql insert queries. 
 Etc
 
 
 How do we query the SolR database from scripts  I know I have a lot to 
 learn still so excuse my ignorance. 
 
 Also...  What is mongo and how does it compare
 
 I just don't understand how in 10years of Web development I have never heard 
 of SolR till last week

Solr is not really a database.  Solr 4.x has a lot of features that make
it function well in some limited NoSQL roles, but it's a search engine,
not a database.  It is a good idea to use the stored setting on your
Solr fields only for those fields that are required to fully display a
search result listing, then use your database as the canonical data
store for displaying full information for a single search result when
the user clicks on it.

Aside from letting you know that it's not a good idea to give Microsoft
your money, I can't really say anything bad about MSSQL.  If it's
working for you and your performance (aside from search) is good,
there's no real reason to move away from it as a data repository.

MongoDB is a NoSQL database.  That would be a candidate for replacing
MSSQL.  Whether or not it could actually replace it depends on your data
model.

Thanks,
Shawn



Re: Newbie SolR - Need advice

2013-07-02 Thread Walter Underwood
Solr is not a database and it does not handle SQL queries. --wunder

On Jul 2, 2013, at 9:09 AM, fabio1605 wrote:

 Thanks guys
 
 So SolR is actually a database replacement for mssql...  Am I right 
 
 
 We have a lot of perl scripts that contains lots of sql insert queries. 
 Etc
 
 
 How do we query the SolR database from scripts  I know I have a lot to 
 learn still so excuse my ignorance. 
 
 Also...  What is mongo and how does it compare
 
 I just don't understand how in 10years of Web development I have never heard 
 of SolR till last week
 
 
 
 
 Sent from Samsung Mobile
 
  Original message 
 From: David Quarterman [via Lucene] 
 ml-node+s472066n4074772...@n3.nabble.com 
 Date: 02/07/2013  16:57  (GMT+00:00) 
 To: fabio1605 fabio.to...@btinternet.com 
 Subject: RE: Newbie SolR - Need advice 
 
 Hi Fabio, 
 
 Like Jack says, try the tutorial. But to answer your question, SOLR isn't a 
 bolt on to SQLServer or any other DB. It's a fantastically fast 
 indexing/searching tool. You'll need to use the DataImportHandler (see the 
 tutorial) to import your data from the DB into the indices that SOLR uses. 
 Once in there, you'll have more power  flexibility than SQLServer would ever 
 give you! 
 
 Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll 
 work using Jetty or Tomcat as web container. 
 
 Stick with it. The ride can be bumpy but the experience is sensational! 
 
 DQ 
 
 -Original Message- 
 From: fabio1605 [mailto:[hidden email]] 
 Sent: 02 July 2013 16:16 
 To: [hidden email] 
 Subject: Newbie SolR - Need advice 
 
 Hi 
 
 we have a MSSQL Server which is just getting far to large now and performance 
 is dying! the majority of our webservers mainly are doing search function so 
 i thought it may be best to move to SolR But i know very little about it! 
 
 My questions are! 
 
 Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and 
 SolR is just the search bit between? 
 
 Im really struggling to understand the point of SOLR etc so if someone could 
 point me to a Dummies website id apprecaite it! google is throwing to much 
 confusion at me! 
 
 
 
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
 Sent from the Solr - User mailing list archive at Nabble.com. 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html
 To unsubscribe from Newbie SolR - Need advice, click here.
 NAML
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html
 Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org





Re: Newbie SolR - Need advice

2013-07-02 Thread fabio1605
Arrfh I see...  So SolR is the search engine for a datastore  Is that what 
mongo is.. A datastore bit. 


Sent from Samsung Mobile

 Original message 
From: Jack Krupansky-2 [via Lucene] 
ml-node+s472066n4074809...@n3.nabble.com 
Date: 02/07/2013  17:51  (GMT+00:00) 
To: fabio1605 fabio.to...@btinternet.com 
Subject: Re: Newbie SolR - Need advice 
 
Consider DataStax Enterprise - it combines Cassandra for NoSql data storage 
with Solr for indexing - fully integrated. 

http://www.datastax.com/

-- Jack Krupansky 

-Original Message- 
From: fabio1605 
Sent: Tuesday, July 02, 2013 12:44 PM 
To: [hidden email] 
Subject: Re: Newbie SolR - Need advice 

Hi 

Ok I'm even more confused now...  Sorry for even more stupid questions. 

So if it's not a database replacement  Where do we keep the database 
then. 

We have a website that is a documentation website that store documents.  It 
has over 130 million records in a table and 50 million in 2 other plus lots 
of little tables 

Most searches are like searching on references or for customer information 
etc. 

However with so much information stored ms sql is starting to get slower 

We have approx 100 tables across 4 different database 

So this is why I started to look at SolR 

Q1 if we used SolR would we still use sql as well as SolR or does SolR 
become sql (speaking theoretically) 

Q2 if so...  How do we move all the data across to SolR. 

Q3 is SolR useful for what we need.  Or is sql the better option based 
on our circumstances.  50percent of our load is from a website...  50 
percent is from scripts adding the information to the site etc 

Sorry for the silly question I'm just getting really confused now 


Sent from Samsung Mobile 

 Original message  
From: Sandeep Mestry [via Lucene] 
[hidden email] 
Date: 02/07/2013  17:29  (GMT+00:00) 
To: fabio1605 [hidden email] 
Subject: Re: Newbie SolR - Need advice 

Hi Fabio, 

No, Solr isn't the database replacement for MS SQL. 
Solr is built on top of Lucene which is a search engine library for text 
searches. 

Solr in itself is not a replacement for any database as it does not support 
any relational db features, however as Jack and David mentioned its fully 
optimised search engine platform that can provide all search related 
features like faceting, highlighting etc. 
Solr does not have a *database*. It stores the data in binary files called 
indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These 
indexes are populated with the data from the database. Solr provides an 
inbuilt functionality through DataImportHandler component to get the data 
and generate indexes. 

When you say, your web servers are mainly doing search function, do you 
mean it is a text search and you use queries with clauses as 'like', 'in' 
etc. (in addition to multiple joints) to get the results? Does the web 
application need faceting? If yes, then solr can be your friend to get it 
through. 

Do remember that it always takes some time to get the new concepts from 
understanding through to implementation. As David mentioned already, it 
*is* going to be a bumpy ride at the start but *definitely* a sensational 
one. 

Good Luck, 
Sandeep 



On 2 July 2013 17:09, fabio1605 [hidden email] wrote: 

 Thanks guys 
 
 So SolR is actually a database replacement for mssql...  Am I right 
 
 
 We have a lot of perl scripts that contains lots of sql insert 
 queries. Etc 
 
 
 How do we query the SolR database from scripts  I know I have a lot to 
 learn still so excuse my ignorance. 
 
 Also...  What is mongo and how does it compare 
 
 I just don't understand how in 10years of Web development I have never 
 heard of SolR till last week 
 
 
 
 
 Sent from Samsung Mobile 
 
  Original message  
 From: David Quarterman [via Lucene]  
 [hidden email] 
 Date: 02/07/2013  16:57  (GMT+00:00) 
 To: fabio1605 [hidden email] 
 Subject: RE: Newbie SolR - Need advice 
 
 Hi Fabio, 
 
 Like Jack says, try the tutorial. But to answer your question, SOLR isn't 
 a bolt on to SQLServer or any other DB. It's a fantastically fast 
 indexing/searching tool. You'll need to use the DataImportHandler (see the 
 tutorial) to import your data from the DB into the indices that SOLR uses. 
 Once in there, you'll have more power  flexibility than SQLServer would 
 ever give you! 
 
 Haven't tried SOLR on Windows (I guess your environment) but I'm sure 
 it'll work using Jetty or Tomcat as web container. 
 
 Stick with it. The ride can be bumpy but the experience is sensational! 
 
 DQ 
 
 -Original Message- 
 From: fabio1605 [mailto:[hidden email]] 
 Sent: 02 July 2013 16:16 
 To: [hidden email] 
 Subject: Newbie SolR - Need advice 
 
 Hi 
 
 we have a MSSQL Server which is just getting far to large now and 
 performance is dying! the majority of our webservers mainly are doing 
 search function so i thought it may be best to move to SolR But i 

Re: Tomcat Solr Server startup fails with FileNotFoundException

2013-07-02 Thread Shawn Heisey
On 7/2/2013 9:39 AM, Murthy Perla wrote:
I am newbie to solr. I've accidentally deleted indexed
 files(manually using rm -rf command) on server from solr index folder. Then
 on when ever I start my server its failing to start with FNF exception. How
 can this be fixed quickly?

I believe this happens when you delete files in the index directory but
don't delete the index directory itself.  Try removing the entire directory.

Thanks,
Shawn



RE: Newbie SolR - Need advice

2013-07-02 Thread David Quarterman
Don’t worry Fabio - nobody knows everything (apart from Hossman). Following on 
from Sandeep, to use SOLR, you extract the data from your MSSQL DB using the 
DataImportHandler and you can then query it, facet it, pivot it to your heart's 
content. And fast!

You can use almost anything to build the SOLR queries - Java  PHP being 
probably most popular. There is a library for Perl I think but never tried it.

So, you keep your mssql database, you just don't use it for searches - that'll 
relieve some of the load. Searches then all go through SOLR  its Lucene 
indexes. If your various tables need SQL joins, you specify those in the 
DataImportHandler (DIH) config. That way, when SOLR indexes everything, it 
indexes the data the way you want to see it.

DIH handles the data export from mssql - SOLR and it's not too difficult to 
set up. 

You imply you're adding (inserting) data. How much, how often? DIH has a delta 
import feature so you can add data on the fly to SOLR's indexes.

Much of it come down to the data model you have. My advice would be try it and 
see. You will be pleasantly surprised!



-Original Message-
From: fabio1605 [mailto:fabio.to...@btinternet.com] 
Sent: 02 July 2013 17:10
To: solr-user@lucene.apache.org
Subject: RE: Newbie SolR - Need advice

Thanks guys

So SolR is actually a database replacement for mssql...  Am I right 


We have a lot of perl scripts that contains lots of sql insert queries. Etc


How do we query the SolR database from scripts  I know I have a lot to 
learn still so excuse my ignorance. 

Also...  What is mongo and how does it compare

I just don't understand how in 10years of Web development I have never heard of 
SolR till last week




Sent from Samsung Mobile

 Original message 
From: David Quarterman [via Lucene] 
ml-node+s472066n4074772...@n3.nabble.com 
Date: 02/07/2013  16:57  (GMT+00:00) 
To: fabio1605 fabio.to...@btinternet.com 
Subject: RE: Newbie SolR - Need advice 
 
Hi Fabio, 

Like Jack says, try the tutorial. But to answer your question, SOLR isn't a 
bolt on to SQLServer or any other DB. It's a fantastically fast 
indexing/searching tool. You'll need to use the DataImportHandler (see the 
tutorial) to import your data from the DB into the indices that SOLR uses. Once 
in there, you'll have more power  flexibility than SQLServer would ever give 
you! 

Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll 
work using Jetty or Tomcat as web container. 

Stick with it. The ride can be bumpy but the experience is sensational! 

DQ 

-Original Message- 
From: fabio1605 [mailto:[hidden email]] 
Sent: 02 July 2013 16:16 
To: [hidden email] 
Subject: Newbie SolR - Need advice 

Hi 

we have a MSSQL Server which is just getting far to large now and performance 
is dying! the majority of our webservers mainly are doing search function so i 
thought it may be best to move to SolR But i know very little about it! 

My questions are! 

Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR 
is just the search bit between? 

Im really struggling to understand the point of SOLR etc so if someone could 
point me to a Dummies website id apprecaite it! google is throwing to much 
confusion at me! 



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
Sent from the Solr - User mailing list archive at Nabble.com. 


If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html
To unsubscribe from Newbie SolR - Need advice, click here.
NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr large boolean filter

2013-07-02 Thread Roman Chyla
Hello @,

This thread 'kicked' me into finishing som long-past task of
sending/receiving large boolean (bitset) filter. We have been using bitsets
with solr before, but now I sat down and wrote it as a qparser. The use
cases, as you have discussed are:

 - necessity to send lng list of ids as a query (where it is not
possible to do it the 'normal' way)
 - or filtering ACLs


It works in the following way:

  - external application constructs bitset and sends it as a query to solr
(q or fq, depends on your needs)
  - solr unpacks the bitset (translated bits into lucene ids, if
necessary), and wraps this into a query which then has the easy job of
'filtering' wanted/unwanted items

Therefore it is good only if you can search against something that is
indexed as integer (id's often are).

A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)

To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)

But I haven't tested latency of sending it over the network and the query
performance, but since the query is very similar as MatchAllDocs, it is
probably very fast (and I know that sending many Mbs to Solr is fast as
well)

I know this is not exactly 'standard' solution, and it is probably not
something you want to see with hundreds of millions of docs, but people
seem to be doing 'not the right thing' all the time;)
So if you think this is something useful for the community, please let me
know. If somebody would be willing to test it, i can file a JIRA ticket.

Thanks!

Roman


The code, if no JIRA is needed, can be found here:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

839ms.  run
154ms.  Building random bitset indexSize=1000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=999
 25ms.  Converting bitset to byte array -- resulting array length=125
20ms.  Encoding byte array into base64 -- resulting array length=168
ratio=1.344
 62ms.  Compressing byte array with GZIP -- resulting array length=1218602
ratio=0.9748816
20ms.  Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
 5ms.  Decoding gzipped byte array from base64
14ms.  Uncompressing decoded byte array
68ms.  Converting from byte array to bitset
 743ms.  running


On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.comwrote:

 Not necessarily. If the auth tokens are available on some
 other system (DB, LDAP, whatever), one could get them
 in the PostFilter and cache them somewhere since,
 presumably, they wouldn't be changing all that often. Or
 use a UserCache and get notified whenever a new searcher
 was opened and regenerate or purge the cache.

 Of course you're right if the post filter does NOT have
 access to the source of truth for the user's privileges.

 FWIW,
 Erick

 On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  The unfortunate thing about this is what you still have to *pass* that
  filter from the client to the server every time you want to use that
  filter.  If that filter is big/long, passing that in all the time has
  some price that could be eliminated by using server-side named
  filters.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  You might consider post filters. The idea
  is to write a custom filter that gets applied
  after all other filters etc. One use-case
  here is exactly ACL lists, and can be quite
  helpful if you're not doing *:* type queries.
 
  Best
  Erick
 
  On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
  Btw. ElasticSearch has a nice feature here.  Not sure what it's
  called, but I call it named filter.
 
  http://www.elasticsearch.org/blog/terms-filter-lookup/
 
  Maybe that's what OP was after?
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
  arafa...@gmail.com wrote:
  On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com
 wrote:
  So I'm using query like
 
 http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29
 
  If the IDs are purely numeric, I wonder if the better way is to send a
  bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000
  is included. Even using URL-encoding rules, you can fit at least 65
  sequential ID flags per character and I am sure there are more
  efficient encoding schemes for long empty sequences.
 
  Regards,
 Alex.
 
 
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality 

Re: Concurrent Modification Exception

2013-07-02 Thread adityab
Anyone , any suggestion or pointers for this issue?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Concurrent-Modification-Exception-tp4074371p4074829.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: set-based and other less common approaches to search

2013-07-02 Thread gilawem
Thanks. So following up on a) below, could I set up and query Solr, without any 
customization of code, to match 10 of my given 20 terms, but only if it finds 
those 10 terms in an xls document under a column that is named MyID or My 
ID or My I.D.? If so, what would that query look like?

On Jul 2, 2013, at 12:38 PM, Otis Gospodnetic wrote:

 Hi,
 
 Solr can do all of these.  There are phrase queries, queries where you
 specify a field, the mm param for min should match, etc.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote:
 Let's say I wanted to ask solr to find me any document that contains at 
 least 100 out of some 300 search terms I give it. Can Solr do this out of 
 the box? If not, what kind of customization would it require?
 
 Now let's say I want to further have the option to request that those terms 
 a) must show up within the same column of an excel spreadsheet, or b) are 
 exact matches (i.e. match on search, but not searched), or c) occur in 
 the exact order that I specified, or d) occur contiguously and without any 
 words in between, or e) are made up of non-word elements such as 92228345 
 or SJA12334.
 
 Can solr do any of these out of the box? If not, what of these tasks is 
 relatively easy to do with some custom code, and what is not?



Re: Solr large boolean filter

2013-07-02 Thread Roman Chyla
Wrong link to the parser, should be:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java


On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello @,

 This thread 'kicked' me into finishing som long-past task of
 sending/receiving large boolean (bitset) filter. We have been using bitsets
 with solr before, but now I sat down and wrote it as a qparser. The use
 cases, as you have discussed are:

  - necessity to send lng list of ids as a query (where it is not
 possible to do it the 'normal' way)
  - or filtering ACLs


 It works in the following way:

   - external application constructs bitset and sends it as a query to solr
 (q or fq, depends on your needs)
   - solr unpacks the bitset (translated bits into lucene ids, if
 necessary), and wraps this into a query which then has the easy job of
 'filtering' wanted/unwanted items

 Therefore it is good only if you can search against something that is
 indexed as integer (id's often are).

 A simple benchmark shows acceptable performance, to send the bitset
 (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)

 To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
 (5+14+68ms)

 But I haven't tested latency of sending it over the network and the query
 performance, but since the query is very similar as MatchAllDocs, it is
 probably very fast (and I know that sending many Mbs to Solr is fast as
 well)

 I know this is not exactly 'standard' solution, and it is probably not
 something you want to see with hundreds of millions of docs, but people
 seem to be doing 'not the right thing' all the time;)
 So if you think this is something useful for the community, please let me
 know. If somebody would be willing to test it, i can file a JIRA ticket.

 Thanks!

 Roman


 The code, if no JIRA is needed, can be found here:

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

 839ms.  run
 154ms.  Building random bitset indexSize=1000 fill=0.5 --
 Size=15054208,cardinality=3934477 highestBit=999
  25ms.  Converting bitset to byte array -- resulting array length=125
 20ms.  Encoding byte array into base64 -- resulting array length=168
 ratio=1.344
  62ms.  Compressing byte array with GZIP -- resulting array
 length=1218602 ratio=0.9748816
 20ms.  Encoding gzipped byte array into base64 -- resulting string
 length=1624804 ratio=1.2998432
  5ms.  Decoding gzipped byte array from base64
 14ms.  Uncompressing decoded byte array
 68ms.  Converting from byte array to bitset
  743ms.  running


 On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Not necessarily. If the auth tokens are available on some
 other system (DB, LDAP, whatever), one could get them
 in the PostFilter and cache them somewhere since,
 presumably, they wouldn't be changing all that often. Or
 use a UserCache and get notified whenever a new searcher
 was opened and regenerate or purge the cache.

 Of course you're right if the post filter does NOT have
 access to the source of truth for the user's privileges.

 FWIW,
 Erick

 On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  The unfortunate thing about this is what you still have to *pass* that
  filter from the client to the server every time you want to use that
  filter.  If that filter is big/long, passing that in all the time has
  some price that could be eliminated by using server-side named
  filters.
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson 
 erickerick...@gmail.com wrote:
  You might consider post filters. The idea
  is to write a custom filter that gets applied
  after all other filters etc. One use-case
  here is exactly ACL lists, and can be quite
  helpful if you're not doing *:* type queries.
 
  Best
  Erick
 
  On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
  Btw. ElasticSearch has a nice feature here.  Not sure what it's
  called, but I call it named filter.
 
  http://www.elasticsearch.org/blog/terms-filter-lookup/
 
  Maybe that's what OP was after?
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
  arafa...@gmail.com wrote:
  On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com
 wrote:
  So I'm using query like
 
 http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29
 
  If the IDs are purely numeric, I wonder if the better way is to send
 a
  bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on 

How to show just the parent domains from results in Solr

2013-07-02 Thread A Geek
hi All, I've indexed documents in my Solr 4.0 index, with fields like URL, 
page_content etc. Now when I run a search query, against the page_content I get 
a lot of urls . And say, if I in total 15 URL domains, and under these 15 
domains I've all the pages indexed in SOLR.  Is there a way in which, I can 
just get the parent URLs for search results instead of getting all the urls. 
For example, say searching for abc returns:
 www.aa.com/11.html www.aa.com/12.htmlwww.aa.com/13.html 
www.bb.com/15.htmlwww.bb.com/18.html
I want the results to be like this:www.aa.comwww.bb.com
Is there a way in SOLR, through which I can achieve this. I've tried 
FieldCollapsing[ https://wiki.apache.org/solr/FieldCollapsing ] but either its 
not the right solution or I'm not able to use it properly. Could someone help 
me find the solution to the above problem. Thanks in advance. 
Regards, KK

  

Re: Solr cloud date based paritioning

2013-07-02 Thread kowish.adamosh
Thanks!

I have very limited response time (max 100ms) therefore sharding is a must.
Data also have trend to grow up to tens of gigs.
Is there any way how to create new logical shard in runtime? I want to
logically partition my data by date. I'm still wondering how is implemented
example from documentation:

/
Query specific shard ids of the (implicit) collection. In this example, the
user has partitioned the index by date, creating a new shard every month:

http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001/

Even in first full load I don't know how to do it... In all examples I can
see that data are distributed physically by uniqeId % coreNum. Are there
some examples of custom (i.e. date based) sharding strategy?

I can see that in JIRA: https://issues.apache.org/jira/browse/SOLR-2592
there is something that may help but I can't find anything in documentation.

Thanks for help!

Kowish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729p4074823.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr large boolean filter

2013-07-02 Thread Mikhail Khludnev
Hello Roman,

Don't you consider to pass long id sequence as body and access internally
in solr as a content stream? It makes base64 compression not necessary.
AFAIK url length is limited somehow, anyway.


On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Wrong link to the parser, should be:

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java


 On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote:

  Hello @,
 
  This thread 'kicked' me into finishing som long-past task of
  sending/receiving large boolean (bitset) filter. We have been using
 bitsets
  with solr before, but now I sat down and wrote it as a qparser. The use
  cases, as you have discussed are:
 
   - necessity to send lng list of ids as a query (where it is not
  possible to do it the 'normal' way)
   - or filtering ACLs
 
 
  It works in the following way:
 
- external application constructs bitset and sends it as a query to
 solr
  (q or fq, depends on your needs)
- solr unpacks the bitset (translated bits into lucene ids, if
  necessary), and wraps this into a query which then has the easy job of
  'filtering' wanted/unwanted items
 
  Therefore it is good only if you can search against something that is
  indexed as integer (id's often are).
 
  A simple benchmark shows acceptable performance, to send the bitset
  (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)
 
  To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
  (5+14+68ms)
 
  But I haven't tested latency of sending it over the network and the query
  performance, but since the query is very similar as MatchAllDocs, it is
  probably very fast (and I know that sending many Mbs to Solr is fast as
  well)
 
  I know this is not exactly 'standard' solution, and it is probably not
  something you want to see with hundreds of millions of docs, but people
  seem to be doing 'not the right thing' all the time;)
  So if you think this is something useful for the community, please let me
  know. If somebody would be willing to test it, i can file a JIRA ticket.
 
  Thanks!
 
  Roman
 
 
  The code, if no JIRA is needed, can be found here:
 
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
 
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java
 
  839ms.  run
  154ms.  Building random bitset indexSize=1000 fill=0.5 --
  Size=15054208,cardinality=3934477 highestBit=999
   25ms.  Converting bitset to byte array -- resulting array length=125
  20ms.  Encoding byte array into base64 -- resulting array length=168
  ratio=1.344
   62ms.  Compressing byte array with GZIP -- resulting array
  length=1218602 ratio=0.9748816
  20ms.  Encoding gzipped byte array into base64 -- resulting string
  length=1624804 ratio=1.2998432
   5ms.  Decoding gzipped byte array from base64
  14ms.  Uncompressing decoded byte array
  68ms.  Converting from byte array to bitset
   743ms.  running
 
 
  On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Not necessarily. If the auth tokens are available on some
  other system (DB, LDAP, whatever), one could get them
  in the PostFilter and cache them somewhere since,
  presumably, they wouldn't be changing all that often. Or
  use a UserCache and get notified whenever a new searcher
  was opened and regenerate or purge the cache.
 
  Of course you're right if the post filter does NOT have
  access to the source of truth for the user's privileges.
 
  FWIW,
  Erick
 
  On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
   Hi,
  
   The unfortunate thing about this is what you still have to *pass* that
   filter from the client to the server every time you want to use that
   filter.  If that filter is big/long, passing that in all the time has
   some price that could be eliminated by using server-side named
   filters.
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson 
  erickerick...@gmail.com wrote:
   You might consider post filters. The idea
   is to write a custom filter that gets applied
   after all other filters etc. One use-case
   here is exactly ACL lists, and can be quite
   helpful if you're not doing *:* type queries.
  
   Best
   Erick
  
   On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
   otis.gospodne...@gmail.com wrote:
   Btw. ElasticSearch has a nice feature here.  Not sure what it's
   called, but I call it named filter.
  
   http://www.elasticsearch.org/blog/terms-filter-lookup/
  
   Maybe that's what OP was after?
  
   Otis
   --
   Solr  ElasticSearch Support
   http://sematext.com/
  
  
  
  
  
   On Mon, Jun 17, 2013 at 4:59 PM, Alexandre 

Re: set-based and other less common approaches to search

2013-07-02 Thread Mikhail Khludnev
try to hit dismax query parser specifying mm and qf parameters.


On Tue, Jul 2, 2013 at 9:31 PM, gilawem mewa...@gmail.com wrote:

 Thanks. So following up on a) below, could I set up and query Solr,
 without any customization of code, to match 10 of my given 20 terms, but
 only if it finds those 10 terms in an xls document under a column that is
 named MyID or My ID or My I.D.? If so, what would that query look
 like?

 On Jul 2, 2013, at 12:38 PM, Otis Gospodnetic wrote:

  Hi,
 
  Solr can do all of these.  There are phrase queries, queries where you
  specify a field, the mm param for min should match, etc.
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
 
  On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote:
  Let's say I wanted to ask solr to find me any document that contains at
 least 100 out of some 300 search terms I give it. Can Solr do this out of
 the box? If not, what kind of customization would it require?
 
  Now let's say I want to further have the option to request that those
 terms a) must show up within the same column of an excel spreadsheet, or b)
 are exact matches (i.e. match on search, but not searched), or c) occur
 in the exact order that I specified, or d) occur contiguously and without
 any words in between, or e) are made up of non-word elements such as
 92228345 or SJA12334.
 
  Can solr do any of these out of the box? If not, what of these tasks is
 relatively easy to do with some custom code, and what is not?




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Converting nested data model to solr schema

2013-07-02 Thread Mikhail Khludnev
during indexing whole block (doc and it's attachment) goes into particular
shard, then it's can be queried per every shard and results are merged.

btw, do you feel any problem with your current approach - query time joins
and out-of-the-box shard routing?


On Tue, Jul 2, 2013 at 5:19 PM, adfel70 adfe...@gmail.com wrote:

 I'm not familiar with block join in lucene. I've read a bit, and I just
 want
 to make sure - do you think that when this ticket is released, it will
 solve
 the current problem of solr cloud joins?

 Also, can you elaborate a bit about your solution?


 Jack Krupansky-2 wrote
  It sounds like 4.4 will have an RC next week, so the prospects for block
  join in 4.4 are kind of dim. I mean, such a significant feature should
  have
  more than a few days to bake before getting released. But... who knows
  what
  Yonik has planned!
 
  -- Jack Krupansky
 
  -Original Message-
  From: adfel70
  Sent: Tuesday, July 02, 2013 7:41 AM
  To:

  solr-user@.apache

  Subject: Re: Converting nested data model to solr schema
 
  As you see it, does SOLR-3076 fixes my problem?
 
  Is SOLR-3076 fix getting into solr 4.4?
 
 
  Mikhail Khludnev wrote
  On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt;
 
  adfel70@
 
  gt; wrote:
 
  This requires me to override the solr document distribution mechanism.
  I fear that with this solution I may loose some of solr cloud's
  capabilities.
 
 
  It's not clear whether you aware of
  http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what
  you
  did doesn't sound scary to me. If it works, it should be fine. I'm not
  aware of any capabilities that you are going to loose.
  Obviously SOLR-3076 provides astonishing query time performance, with
  offloading actual join work into index time. Check it if you current
  approach turns slow.
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  lt;http://www.griddynamics.comgt;
   lt;
 
  mkhludnev@
 
  gt;
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html
  Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


copyField and storage requirements

2013-07-02 Thread Ali, Saqib
Newbie question:

We have the following fields defined in the schema:

field name=content type=text_general indexed=true stored=false/
field name=teaser type=text_general indexed=false stored=true/
copyField source=content dest=teaser maxChars=80/

the content is field is about 500KB data.

My question is whether Solr stores the entire contents of the that 500KB
content field?

We want to minimize the stored data in the Solr index, that is why we added
the copyField teaser.

Thanks
Saqib


Request to Edit Solr Wiki

2013-07-02 Thread Vivek Shivaprabhu
Hi

I'd like to contribute to some of the page in the Solr Wiki at
wiki.apache.org/solr

My username is VivekShivaprabhu (alias: vivekrs)

Please do the needful. Thanks in advance!

-Vivek R S


Re: Request to Edit Solr Wiki

2013-07-02 Thread Erick Erickson
Done, added VivekShivaprabhu to the Solr contributor's group. Let us know
if you need the alias instead

And thanks for helping with the Wiki!

Erick


On Tue, Jul 2, 2013 at 1:42 PM, Vivek Shivaprabhu vivekrs@gmail.comwrote:

 Hi

 I'd like to contribute to some of the page in the Solr Wiki at
 wiki.apache.org/solr

 My username is VivekShivaprabhu (alias: vivekrs)

 Please do the needful. Thanks in advance!

 -Vivek R S



Re: Two instances of solr - the same datadir?

2013-07-02 Thread Roman Chyla
as i discovered, it is not good to use 'native' locktype in this scenario,
actually there is a note in the solrconfig.xml which says the same

when a core is reloaded and solr tries to grab lock, it will fail - even if
the instance is configured to be read-only, so i am using 'single' lock for
the readers and 'native' for the writer, which seems to work OK

roman


On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote:

 I have auto commit after 40k RECs/1800secs. But I only tested with manual
 commit, but I don't see why it should work differently.
 Roman
 On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:

 If it makes you feel better, I also considered this approach when I was in
 the same situation with a separate indexer and searcher on one Physical
 linux machine.

 My main concern was re-using the FS cache between both instances - If I
 replicated to myself there would be two independent copies of the index,
 FS-cached separately.

 I like the suggestion of using autoCommit to reload the index. If I'm
 reading that right, you'd set an autoCommit on 'zero docs changing', or
 just 'every N seconds'? Did that work?

 Best of luck!

 Tim


 On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:

  So here it is for a record how I am solving it right now:
 
  Write-master is started with: -Dmontysolr.warming.enabled=false
  -Dmontysolr.write.master=true -Dmontysolr.read.master=
  http://localhost:5005
  Read-master is started with: -Dmontysolr.warming.enabled=true
  -Dmontysolr.write.master=false
 
 
  solrconfig.xml changes:
 
  1. all index changing components have this bit,
  enable=${montysolr.master:true} - ie.
 
  updateHandler class=solr.DirectUpdateHandler2
   enable=${montysolr.master:true}
 
  2. for cache warming de/activation
 
  listener event=newSearcher
class=solr.QuerySenderListener
enable=${montysolr.enable.warming:true}...
 
  3. to trigger refresh of the read-only-master (from write-master):
 
  listener event=postCommit
class=solr.RunExecutableListener
enable=${montysolr.master:true}
str name=execurl/str
str name=dir./str
bool name=waitfalse/bool
arr name=args str${montysolr.read.master:http://localhost
 
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
  /listener
 
  This works, I still don't like the reload of the whole core, but it
 seems
  like the easiest thing to do now.
 
  -- roman
 
 
  On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi Peter,
  
   Thank you, I am glad to read that this usecase is not alien.
  
   I'd like to make the second instance (searcher) completely read-only,
 so
  I
   have disabled all the components that can write.
  
   (being lazy ;)) I'll probably use
   http://wiki.apache.org/solr/CollectionDistribution to call the curl
  after
   commit, or write some IndexReaderFactory that checks for changes
  
   The problem with calling the 'core reload' - is that it seems lots of
  work
   for just opening a new searcher, eeekkk...somewhere I read that it is
  cheap
   to reload a core, but re-opening the index searches must be definitely
   cheaper...
  
   roman
  
  
   On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com
  wrote:
  
   Hi,
   We use this very same scenario to great effect - 2 instances using
 the
   same
   dataDir with many cores - 1 is a writer (no caching), the other is a
   searcher (lots of caching).
   To get the searcher to see the index changes from the writer, you
 need
  the
   searcher to do an empty commit - i.e. you invoke a commit with 0
   documents.
   This will refresh the caches (including autowarming), [re]build the
   relevant searchers etc. and make any index changes visible to the RO
   instance.
   Also, make sure to use lockTypenative/lockType in solrconfig.xml
 to
   ensure the two instances don't try to commit at the same time.
   There are several ways to trigger a commit:
   Call commit() periodically within your own code.
   Use autoCommit in solrconfig.xml.
   Use an RPC/IPC mechanism between the 2 instance processes to tell the
   searcher the index has changed, then call commit when called (more
  complex
   coding, but good if the index changes on an ad-hoc basis).
   Note, doing things this way isn't really suitable for an NRT
  environment.
  
   HTH,
   Peter
  
  
  
   On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Replication is fine, I am going to use it, but I wanted it for
  instances
*distributed* across several (physical) machines - but here I have
 one
physical machine, it has many cores. I want to run 2 instances of
 solr
because I think it has these benefits:
   
1) I can give less RAM to the writer (4GB), and use more RAM for
 the
searcher (28GB)
2) I can deactivate warming for the writer and keep it for the
  searcher

Re: copyField and storage requirements

2013-07-02 Thread Shawn Heisey
On 7/2/2013 12:22 PM, Ali, Saqib wrote:
 Newbie question:
 
 We have the following fields defined in the schema:
 
 field name=content type=text_general indexed=true stored=false/
 field name=teaser type=text_general indexed=false stored=true/
 copyField source=content dest=teaser maxChars=80/
 
 the content is field is about 500KB data.
 
 My question is whether Solr stores the entire contents of the that 500KB
 content field?
 
 We want to minimize the stored data in the Solr index, that is why we added
 the copyField teaser.

With that config, the entire 500KB will not be _stored_ .. but it will
affect the index size because you are indexing it.  Exactly what degree
that will be depends on the definition of the text_general type.

Thanks,
Shawn



Re: Solr large boolean filter

2013-07-02 Thread Roman Chyla
Hello Mikhail,

Yes, GET is limited, but POST is not - so I just wanted that it works in
both the same way. But I am not sure if I am understanding your question
completely. Could you elaborate on the parameters/body part? Is there no
need for encoding of binary data inside the body? Or do you mean it is
treated as a string? Or is it just a bytestream and other parameters are
seen as string?

On a general note: my main concern was to send many ids fast, if we use
ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb
check numbers please :)). But certainly, if the bitset is sparse or the
collection of ids just a 'a few thousands', stream of ints/longs will be
smaller, better to use.

roman



On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Hello Roman,

 Don't you consider to pass long id sequence as body and access internally
 in solr as a content stream? It makes base64 compression not necessary.
 AFAIK url length is limited somehow, anyway.


 On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote:

  Wrong link to the parser, should be:
 
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java
 
 
  On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   Hello @,
  
   This thread 'kicked' me into finishing som long-past task of
   sending/receiving large boolean (bitset) filter. We have been using
  bitsets
   with solr before, but now I sat down and wrote it as a qparser. The use
   cases, as you have discussed are:
  
- necessity to send lng list of ids as a query (where it is not
   possible to do it the 'normal' way)
- or filtering ACLs
  
  
   It works in the following way:
  
 - external application constructs bitset and sends it as a query to
  solr
   (q or fq, depends on your needs)
 - solr unpacks the bitset (translated bits into lucene ids, if
   necessary), and wraps this into a query which then has the easy job of
   'filtering' wanted/unwanted items
  
   Therefore it is good only if you can search against something that is
   indexed as integer (id's often are).
  
   A simple benchmark shows acceptable performance, to send the bitset
   (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)
  
   To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
   (5+14+68ms)
  
   But I haven't tested latency of sending it over the network and the
 query
   performance, but since the query is very similar as MatchAllDocs, it is
   probably very fast (and I know that sending many Mbs to Solr is fast as
   well)
  
   I know this is not exactly 'standard' solution, and it is probably not
   something you want to see with hundreds of millions of docs, but people
   seem to be doing 'not the right thing' all the time;)
   So if you think this is something useful for the community, please let
 me
   know. If somebody would be willing to test it, i can file a JIRA
 ticket.
  
   Thanks!
  
   Roman
  
  
   The code, if no JIRA is needed, can be found here:
  
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
  
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java
  
   839ms.  run
   154ms.  Building random bitset indexSize=1000 fill=0.5 --
   Size=15054208,cardinality=3934477 highestBit=999
25ms.  Converting bitset to byte array -- resulting array
 length=125
   20ms.  Encoding byte array into base64 -- resulting array
 length=168
   ratio=1.344
62ms.  Compressing byte array with GZIP -- resulting array
   length=1218602 ratio=0.9748816
   20ms.  Encoding gzipped byte array into base64 -- resulting string
   length=1624804 ratio=1.2998432
5ms.  Decoding gzipped byte array from base64
   14ms.  Uncompressing decoded byte array
   68ms.  Converting from byte array to bitset
743ms.  running
  
  
   On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
  
   Not necessarily. If the auth tokens are available on some
   other system (DB, LDAP, whatever), one could get them
   in the PostFilter and cache them somewhere since,
   presumably, they wouldn't be changing all that often. Or
   use a UserCache and get notified whenever a new searcher
   was opened and regenerate or purge the cache.
  
   Of course you're right if the post filter does NOT have
   access to the source of truth for the user's privileges.
  
   FWIW,
   Erick
  
   On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
   otis.gospodne...@gmail.com wrote:
Hi,
   
The unfortunate thing about this is what you still have to *pass*
 that
filter from the client to the server every time you want to use that
filter.  If that filter is big/long, passing that in all the time
 has
some price that 

Re: Two instances of solr - the same datadir?

2013-07-02 Thread Peter Sturge
Hmmm, single lock sounds dangerous. It probably works ok because you've
been [un]lucky.
For example, even with a RO instance, you still need to do a commit in
order to reload caches/changes from the other instance.
What happens if this commit gets called in the middle of the other
instance's commit? I've not tested this scenario, but it's very possible
with a 'single' lock the results are indeterminate.
If the 'single' lock mechanism is making assumptions e.g. no other process
will interfere, and then one does, the Lucene index could very well get
corrupted.

For the error you're seeing using 'native', we use native lockType for both
write and RO instances, and it works fine - no contention.
Which version of Solr are you using? Perhaps there's been a change in
behaviour?

Peter


On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote:

 as i discovered, it is not good to use 'native' locktype in this scenario,
 actually there is a note in the solrconfig.xml which says the same

 when a core is reloaded and solr tries to grab lock, it will fail - even if
 the instance is configured to be read-only, so i am using 'single' lock for
 the readers and 'native' for the writer, which seems to work OK

 roman


 On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote:

  I have auto commit after 40k RECs/1800secs. But I only tested with manual
  commit, but I don't see why it should work differently.
  Roman
  On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:
 
  If it makes you feel better, I also considered this approach when I was
 in
  the same situation with a separate indexer and searcher on one Physical
  linux machine.
 
  My main concern was re-using the FS cache between both instances - If
 I
  replicated to myself there would be two independent copies of the index,
  FS-cached separately.
 
  I like the suggestion of using autoCommit to reload the index. If I'm
  reading that right, you'd set an autoCommit on 'zero docs changing', or
  just 'every N seconds'? Did that work?
 
  Best of luck!
 
  Tim
 
 
  On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:
 
   So here it is for a record how I am solving it right now:
  
   Write-master is started with: -Dmontysolr.warming.enabled=false
   -Dmontysolr.write.master=true -Dmontysolr.read.master=
   http://localhost:5005
   Read-master is started with: -Dmontysolr.warming.enabled=true
   -Dmontysolr.write.master=false
  
  
   solrconfig.xml changes:
  
   1. all index changing components have this bit,
   enable=${montysolr.master:true} - ie.
  
   updateHandler class=solr.DirectUpdateHandler2
enable=${montysolr.master:true}
  
   2. for cache warming de/activation
  
   listener event=newSearcher
 class=solr.QuerySenderListener
 enable=${montysolr.enable.warming:true}...
  
   3. to trigger refresh of the read-only-master (from write-master):
  
   listener event=postCommit
 class=solr.RunExecutableListener
 enable=${montysolr.master:true}
 str name=execurl/str
 str name=dir./str
 bool name=waitfalse/bool
 arr name=args str${montysolr.read.master:http://localhost
  
  
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
   /listener
  
   This works, I still don't like the reload of the whole core, but it
  seems
   like the easiest thing to do now.
  
   -- roman
  
  
   On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Hi Peter,
   
Thank you, I am glad to read that this usecase is not alien.
   
I'd like to make the second instance (searcher) completely
 read-only,
  so
   I
have disabled all the components that can write.
   
(being lazy ;)) I'll probably use
http://wiki.apache.org/solr/CollectionDistribution to call the curl
   after
commit, or write some IndexReaderFactory that checks for changes
   
The problem with calling the 'core reload' - is that it seems lots
 of
   work
for just opening a new searcher, eeekkk...somewhere I read that it
 is
   cheap
to reload a core, but re-opening the index searches must be
 definitely
cheaper...
   
roman
   
   
On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge 
 peter.stu...@gmail.com
   wrote:
   
Hi,
We use this very same scenario to great effect - 2 instances using
  the
same
dataDir with many cores - 1 is a writer (no caching), the other is
 a
searcher (lots of caching).
To get the searcher to see the index changes from the writer, you
  need
   the
searcher to do an empty commit - i.e. you invoke a commit with 0
documents.
This will refresh the caches (including autowarming), [re]build the
relevant searchers etc. and make any index changes visible to the
 RO
instance.
Also, make sure to use lockTypenative/lockType in
 solrconfig.xml
  to
ensure the two instances don't try to commit at the same time.

Re: Two instances of solr - the same datadir?

2013-07-02 Thread Roman Chyla
Interesting, we are running 4.0 - and solr will refuse the start (or
reload) the core. But from looking at the code I am not seeing it is doing
any writing - but I should digg more...

Are you sure it needs to do writing? Because I am not calling commits, in
fact I have deactivated *all* components that write into index, so unless
there is something deep inside, which automatically calls the commit, it
should never happen.

roman


On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com wrote:

 Hmmm, single lock sounds dangerous. It probably works ok because you've
 been [un]lucky.
 For example, even with a RO instance, you still need to do a commit in
 order to reload caches/changes from the other instance.
 What happens if this commit gets called in the middle of the other
 instance's commit? I've not tested this scenario, but it's very possible
 with a 'single' lock the results are indeterminate.
 If the 'single' lock mechanism is making assumptions e.g. no other process
 will interfere, and then one does, the Lucene index could very well get
 corrupted.

 For the error you're seeing using 'native', we use native lockType for both
 write and RO instances, and it works fine - no contention.
 Which version of Solr are you using? Perhaps there's been a change in
 behaviour?

 Peter


 On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote:

  as i discovered, it is not good to use 'native' locktype in this
 scenario,
  actually there is a note in the solrconfig.xml which says the same
 
  when a core is reloaded and solr tries to grab lock, it will fail - even
 if
  the instance is configured to be read-only, so i am using 'single' lock
 for
  the readers and 'native' for the writer, which seems to work OK
 
  roman
 
 
  On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   I have auto commit after 40k RECs/1800secs. But I only tested with
 manual
   commit, but I don't see why it should work differently.
   Roman
   On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:
  
   If it makes you feel better, I also considered this approach when I
 was
  in
   the same situation with a separate indexer and searcher on one
 Physical
   linux machine.
  
   My main concern was re-using the FS cache between both instances -
 If
  I
   replicated to myself there would be two independent copies of the
 index,
   FS-cached separately.
  
   I like the suggestion of using autoCommit to reload the index. If I'm
   reading that right, you'd set an autoCommit on 'zero docs changing',
 or
   just 'every N seconds'? Did that work?
  
   Best of luck!
  
   Tim
  
  
   On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:
  
So here it is for a record how I am solving it right now:
   
Write-master is started with: -Dmontysolr.warming.enabled=false
-Dmontysolr.write.master=true -Dmontysolr.read.master=
http://localhost:5005
Read-master is started with: -Dmontysolr.warming.enabled=true
-Dmontysolr.write.master=false
   
   
solrconfig.xml changes:
   
1. all index changing components have this bit,
enable=${montysolr.master:true} - ie.
   
updateHandler class=solr.DirectUpdateHandler2
 enable=${montysolr.master:true}
   
2. for cache warming de/activation
   
listener event=newSearcher
  class=solr.QuerySenderListener
  enable=${montysolr.enable.warming:true}...
   
3. to trigger refresh of the read-only-master (from write-master):
   
listener event=postCommit
  class=solr.RunExecutableListener
  enable=${montysolr.master:true}
  str name=execurl/str
  str name=dir./str
  bool name=waitfalse/bool
  arr name=args str${montysolr.read.master:
 http://localhost
   
   
  
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
/listener
   
This works, I still don't like the reload of the whole core, but it
   seems
like the easiest thing to do now.
   
-- roman
   
   
On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
 
wrote:
   
 Hi Peter,

 Thank you, I am glad to read that this usecase is not alien.

 I'd like to make the second instance (searcher) completely
  read-only,
   so
I
 have disabled all the components that can write.

 (being lazy ;)) I'll probably use
 http://wiki.apache.org/solr/CollectionDistribution to call the
 curl
after
 commit, or write some IndexReaderFactory that checks for changes

 The problem with calling the 'core reload' - is that it seems lots
  of
work
 for just opening a new searcher, eeekkk...somewhere I read that it
  is
cheap
 to reload a core, but re-opening the index searches must be
  definitely
 cheaper...

 roman


 On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge 
  peter.stu...@gmail.com
wrote:

 

Filter cache pollution during sharded edismax queries

2013-07-02 Thread Ken Krugler
Hi all,

After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had 
dropped significantly.

Previously it was at 95+%, but now it's  50%.

I enabled recording 100 entries for debugging, and in looking at them it seems 
that edismax (and faceting) is creating entries for me.

This is in a sharded setup, so it's a distributed search.

If I do a search for the string bogus text using edismax on two fields, I get 
an entry in each of the shard's filter caches that looks like:

item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2):

Is this expected?

I have a similar situation happening during faceted search, even though my 
fields are single-value/untokenized strings, and I'm not using the enum facet 
method.

But I'll get many, many entries in the filterCache for facet values, and they 
all look like item_facet field:facet value:

The net result of the above is that even with a very big filterCache size of 
2K, the hit ratio is still only 60%.

Thanks for any insights,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







Re: Replicating files containing external file fields

2013-07-02 Thread Arun Rangarajan
Jack and Erick,
Thanks for your replies. I am able to replicate ext file fields by
specifying the relative paths for each individual file. confFiles in
solrconfig.xml is really long now with lot of ../ and I got 5 ext file
field files. Would be really nice if wild-cards were supported here :-).

About the reloadCache on slave: following
http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
I
set up listeners to reload the ext file fields after commits. Since the
slave replicationHandler issues a commit after it replicates the files (as
mentioned in
https://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F),
I believe the ext file fields get reloaded to the slave cache after
replication. This is exactly what I was looking for.


On Fri, Jun 28, 2013 at 5:08 PM, Jack Krupansky j...@basetechnology.comwrote:

 Yes, you need to list that EFF file in the confFiles list - only those
 listed files will be replicated.

 str
 name=confFilessolrconfig.**xml,data-config.xml,schema.**
 xml,stopwords.txt,synonyms.**txt,elevate.xml,
 /var/solr-data/List/external_***/str

 Oops... sorry, no wildcards... you must list the individual files.

 Technically, the path is supposed to be relative to the Solr collection
 conf directory, so you MAY have you may have to put lots of ../ in the
 path to get to the files, like:

 ../../../../solr-data/List/**external_1

 Tor each file.

 (This is what Erick was referring to.)

 Sorry, I don't have the answer to the reload question at the tip of my
 tongue.


 -- Jack Krupansky

 -Original Message- From: Arun Rangarajan
 Sent: Friday, June 28, 2013 7:42 PM

 To: solr-user@lucene.apache.org
 Subject: Re: Replicating files containing external file fields

 Jack,

 Here is the ReplicationHandler definition from solrconfig.xml:

 requestHandler name=/replication class=solr.**ReplicationHandler 
 lst name=master
 str name=enable${enable.master:**false}/str
 str name=replicateAfterstartup**/str
 str name=replicateAftercommit/**str
 str name=replicateAfter**optimize/str
 str
 name=confFilessolrconfig.**xml,data-config.xml,schema.**
 xml,stopwords.txt,synonyms.**txt,elevate.xml/str
 /lst
 lst name=slave
 str name=enable${enable.slave:**false}/str
 str name=masterUrlhttp://${**master.ip}:${master.port}/**solr/${
 solr.core.name}/replication/**str
 str name=pollInterval00:01:00/**str
 /lst
 /requestHandler

 The confFiles are under the dir:
 /var/solr/application-cores/**List/conf
 and the external file fields are like:
 /var/solr-data/List/external_*

 Should I add
 /var/solr-data/List/external_*
 to confFiles like this?

 str
 name=confFilessolrconfig.**xml,data-config.xml,schema.**
 xml,stopwords.txt,synonyms.**txt,elevate.xml,
 /var/solr-data/List/external_***/str


 Also, can you tell me when (or whether) I need to do reloadCache on the
 slave after the ext file fields are replicated?

 Thx.


 On Fri, Jun 28, 2013 at 10:13 AM, Jack Krupansky j...@basetechnology.com
 **wrote:

  Show us your confFiles directive. Maybe there is some subtle error in
 the file name.

 -- Jack Krupansky

 -Original Message- From: Arun Rangarajan
 Sent: Friday, June 28, 2013 1:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Replicating files containing external file fields


 Erick,
 Thx for your reply. The external file field fields are already under
 dataDir specified in solrconfig.xml. They are not getting replicated.
 (Solr version 4.2.1.)


 On Thu, Jun 27, 2013 at 10:50 AM, Erick Erickson erickerick...@gmail.com
 
 **wrote:


  Haven't tried this, but I _think_ you can use the

 confFiles trick with relative paths, see:
 http://wiki.apache.org/solr/SolrReplicationhttp://wiki.apache.org/solr/**SolrReplication
 http://wiki.**apache.org/solr/**SolrReplicationhttp://wiki.apache.org/solr/SolrReplication
 


 Or just put your EFF files in the data dir?

 Best
 Erick


 On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan
 arunrangara...@gmail.comwrote:

  From  
  https://wiki.apache.org/solr/SolrReplicationhttps://wiki.apache.org/solr/**SolrReplication
 https://wiki.**apache.org/solr/**SolrReplicationhttps://wiki.apache.org/solr/SolrReplicationI
  understand that

 index
  dir and any files under the conf dir can be replicated to slaves. I 
 want
 to
  know if there is any way the files under the data dir containing 
 external
  file fields can be replicated. These are not replicated by default.
  Currently we are running the ext file field reload script on both the
  master and the slave and then running reloadCache on each server once
 they
  are loaded.
 







Re: Solr large boolean filter

2013-07-02 Thread Mikhail Khludnev
Roman,

It's covered in http://wiki.apache.org/solr/ContentStream
 | For POST requests where the content-type is not
application/x-www-form-urlencoded, the raw POST body is passed as a
stream.

So, there is no need for encoding of binary data inside the body.

Regarding encoding, I have a positive experience of passing such ids
encoded by vInt, but they need to be presorted.



On Tue, Jul 2, 2013 at 10:46 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello Mikhail,

 Yes, GET is limited, but POST is not - so I just wanted that it works in
 both the same way. But I am not sure if I am understanding your question
 completely. Could you elaborate on the parameters/body part? Is there no
 need for encoding of binary data inside the body? Or do you mean it is
 treated as a string? Or is it just a bytestream and other parameters are
 seen as string?

 On a general note: my main concern was to send many ids fast, if we use
 ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb
 check numbers please :)). But certainly, if the bitset is sparse or the
 collection of ids just a 'a few thousands', stream of ints/longs will be
 smaller, better to use.

 roman



 On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:

  Hello Roman,
 
  Don't you consider to pass long id sequence as body and access internally
  in solr as a content stream? It makes base64 compression not necessary.
  AFAIK url length is limited somehow, anyway.
 
 
  On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   Wrong link to the parser, should be:
  
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java
  
  
   On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
Hello @,
   
This thread 'kicked' me into finishing som long-past task of
sending/receiving large boolean (bitset) filter. We have been using
   bitsets
with solr before, but now I sat down and wrote it as a qparser. The
 use
cases, as you have discussed are:
   
 - necessity to send lng list of ids as a query (where it is not
possible to do it the 'normal' way)
 - or filtering ACLs
   
   
It works in the following way:
   
  - external application constructs bitset and sends it as a query to
   solr
(q or fq, depends on your needs)
  - solr unpacks the bitset (translated bits into lucene ids, if
necessary), and wraps this into a query which then has the easy job
 of
'filtering' wanted/unwanted items
   
Therefore it is good only if you can search against something that is
indexed as integer (id's often are).
   
A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms
 (25+64+20)
   
To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)
   
But I haven't tested latency of sending it over the network and the
  query
performance, but since the query is very similar as MatchAllDocs, it
 is
probably very fast (and I know that sending many Mbs to Solr is fast
 as
well)
   
I know this is not exactly 'standard' solution, and it is probably
 not
something you want to see with hundreds of millions of docs, but
 people
seem to be doing 'not the right thing' all the time;)
So if you think this is something useful for the community, please
 let
  me
know. If somebody would be willing to test it, i can file a JIRA
  ticket.
   
Thanks!
   
Roman
   
   
The code, if no JIRA is needed, can be found here:
   
   
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
   
   
  
 
 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java
   
839ms.  run
154ms.  Building random bitset indexSize=1000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=999
 25ms.  Converting bitset to byte array -- resulting array
  length=125
20ms.  Encoding byte array into base64 -- resulting array
  length=168
ratio=1.344
 62ms.  Compressing byte array with GZIP -- resulting array
length=1218602 ratio=0.9748816
20ms.  Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
 5ms.  Decoding gzipped byte array from base64
14ms.  Uncompressing decoded byte array
68ms.  Converting from byte array to bitset
 743ms.  running
   
   
On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
   
Not necessarily. If the auth tokens are available on some
other system (DB, LDAP, whatever), one could get them
in the PostFilter and cache them somewhere since,
presumably, they wouldn't be changing all that often. Or
use 

Re: DIH: HTMLStripTransformer in sub-entities?

2013-07-02 Thread Gora Mohanty
On 2 July 2013 20:55, Andy Pickler andy.pick...@gmail.com wrote:
 Thanks for the quick reply.  Unfortunately, I don't believe my company
 would want me sharing our exact production schema in a public forum,
 although I realize it makes it harder to diagnose the problem.  The
 sub-entity is a multi-valued field that indeed does have a relationship to
 the outer entity.  I just left off the 'where' clause from the sub-entity,
 as I didn't believe it was helpful in the context of this problem.  We use
 the convention of..

 SELECT dbColumnName AS solrFieldName

 ...so that we can relate the database column name to what we what it to be
 named in the Solr index.

 I don't think any of this helps you identify my problem, but I tried to
 address your questions.

Um, with all due respect, I do not then know how to
address your issues in a public forum.

Maybe you are then better off hiring someone to handle
your specific problems, after signing a NDA or whatever
it takes from your side: Please see http://wiki.apache.org/solr/Support

Regards,
Gora


Re: Converting nested data model to solr schema

2013-07-02 Thread adfel70
My current solution is overriding the  out-of-the-box shard routing, and
forcing each document and its attachment to go into a specific shard. But
this is so I can support the query time joins (because join are only
performed between documents in the same shard).

I'm a bit concerned by this approach only because it forces me to overdrive
out-of-the-box solr behavior.
I didn't implement the whole thing yet, so can't say anything about
performance.

You're saying that your block-join solution does the same thing at index
time (putting document and its attachments in the same shard), but at query
time it doesn't require to perform explicit join?
If you could add an example of what you'll index, and how you'll query , it
would be very helpful.

Also, if this ticket is going to get into one of the next releases, and it
solves the join problem, it seems that its worth waiting for.



Mikhail Khludnev wrote
 during indexing whole block (doc and it's attachment) goes into particular
 shard, then it's can be queried per every shard and results are merged.
 
 btw, do you feel any problem with your current approach - query time joins
 and out-of-the-box shard routing?
 
 
 On Tue, Jul 2, 2013 at 5:19 PM, adfel70 lt;

 adfel70@

 gt; wrote:
 
 I'm not familiar with block join in lucene. I've read a bit, and I just
 want
 to make sure - do you think that when this ticket is released, it will
 solve
 the current problem of solr cloud joins?

 Also, can you elaborate a bit about your solution?


 Jack Krupansky-2 wrote
  It sounds like 4.4 will have an RC next week, so the prospects for
 block
  join in 4.4 are kind of dim. I mean, such a significant feature should
  have
  more than a few days to bake before getting released. But... who knows
  what
  Yonik has planned!
 
  -- Jack Krupansky
 
  -Original Message-
  From: adfel70
  Sent: Tuesday, July 02, 2013 7:41 AM
  To:

  solr-user@.apache

  Subject: Re: Converting nested data model to solr schema
 
  As you see it, does SOLR-3076 fixes my problem?
 
  Is SOLR-3076 fix getting into solr 4.4?
 
 
  Mikhail Khludnev wrote
  On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt;
 
  adfel70@
 
  gt; wrote:
 
  This requires me to override the solr document distribution
 mechanism.
  I fear that with this solution I may loose some of solr cloud's
  capabilities.
 
 
  It's not clear whether you aware of
  http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but
 what
  you
  did doesn't sound scary to me. If it works, it should be fine. I'm not
  aware of any capabilities that you are going to loose.
  Obviously SOLR-3076 provides astonishing query time performance, with
  offloading actual join work into index time. Check it if you current
  approach turns slow.
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  lt;http://www.griddynamics.comgt;
   lt;
 
  mkhludnev@
 
  gt;
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html
  Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 
 
 -- 
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 lt;http://www.griddynamics.comgt;
  lt;

 mkhludnev@

 gt;





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074876.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr cloud date based paritioning

2013-07-02 Thread Gora Mohanty
On 2 July 2013 22:35, kowish.adamosh kowish.adam...@gmail.com wrote:
 Thanks!

 I have very limited response time (max 100ms) therefore sharding is a must.

Really? Sharding is a must without any measurements to
validate that assertion? I am not sure what advice to give
you if you seem determined to ignore any, but as a touch
point, in the days of Solr 1.4 (much improved perfomance
since then), out of the box we used to get an average time
of well under 100ms for queries with  50 simultaneous
users on an index with *everything* stored, and an index
size of  80 GB. This was admittedly non-scientific, as the
cache enters significantly into the equation, but I will urge
you again: Try measuring things before adding bells and
whistles.

Regards,
Gora


Access to Solr Wiki

2013-07-02 Thread Gora Mohanty
Hi,

May I please be added to the list of editors to the
Solr Wiki as I see that some earlier changes seem
to have gone missing. My user name is GoraMohanty
Thanks.

Regards,
Gora


How to query Solr for empty field or specific value

2013-07-02 Thread Van Tassell, Kristian
Hello,

I'm using Solr 4.2 and am trying to get a specific value (blue) or null field 
(no color) returned by my filter query. My results should yield 3 documents (If 
I execute the two separate filters in different queries, I get 2 hits for one 
query and 1 for the other).

I've tried this (blue or no color set):

select?q=*:*fq=(-color:[* TO *] OR color:blue)

When that returned zero hits, I added a new field called color.not_null and 
am setting it only if a color is defined (thinking there was a problem with 
using the same field name).

select?q=*:*fq=(-color.not_null:[* TO *] OR color:blue)

That too yielded zero results. Again, executing them separately does return 
hits (3).

Does anyone see what I might be doing wrong? Thanks in advance,
Kristian


RE: Newbie SolR - Need advice

2013-07-02 Thread fabio1605

So, you keep your mssql database, you just don't use it for searches -
that'll relieve some of the load. Searches then all go through SOLR  its
Lucene indexes. If your various tables need SQL joins, you specify those in
the DataImportHandler (DIH) config. That way, when SOLR indexes everything,
it indexes the data the way you want to see it. 

-- SO  by this you mean we keep mssql as we do!!

But we use the website to run through SOLR SOLR will then handle the
indexing and retrieval of data from its own index's, and will make its own
calls to our MSSQL server when required(i.e updating/adding to indexs..)

Am I on the right tracks there now!

So MSSQL becomes the datastore
SOLR becomes the search engine...





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074889.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: copyField and storage requirements

2013-07-02 Thread Ali, Saqib
Thanks Shawn.

Here is the text_general type definition. We would like to bring down the
storage requirement down to a minimum for those 500KB content documents. We
just need basic full-text search.

Thanks!!! :)




fieldType name=text_general class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/
!-- in this example, we will only use synonyms at query
time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType



On Tue, Jul 2, 2013 at 11:35 AM, Shawn Heisey s...@elyograg.org wrote:

 On 7/2/2013 12:22 PM, Ali, Saqib wrote:
  Newbie question:
 
  We have the following fields defined in the schema:
 
  field name=content type=text_general indexed=true stored=false/
  field name=teaser type=text_general indexed=false stored=true/
  copyField source=content dest=teaser maxChars=80/
 
  the content is field is about 500KB data.
 
  My question is whether Solr stores the entire contents of the that 500KB
  content field?
 
  We want to minimize the stored data in the Solr index, that is why we
 added
  the copyField teaser.

 With that config, the entire 500KB will not be _stored_ .. but it will
 affect the index size because you are indexing it.  Exactly what degree
 that will be depends on the definition of the text_general type.

 Thanks,
 Shawn




Re: How to query Solr for empty field or specific value

2013-07-02 Thread Jack Krupansky
Better to define color.not_null as a boolean field and always initialize as 
either true or false.


But, even without that you need write a pure negative query or clause as

   (*:* -term)

So:

   select?q=*:*fq=((*:* -color:[* TO *]) OR color:blue)

and

   select?q=*:*fq=((*:* -color.not_null:[* TO *]) OR color:blue)

-- Jack Krupansky

-Original Message- 
From: Van Tassell, Kristian

Sent: Tuesday, July 02, 2013 3:47 PM
To: solr-user@lucene.apache.org
Subject: How to query Solr for empty field or specific value

Hello,

I'm using Solr 4.2 and am trying to get a specific value (blue) or null 
field (no color) returned by my filter query. My results should yield 3 
documents (If I execute the two separate filters in different queries, I get 
2 hits for one query and 1 for the other).


I've tried this (blue or no color set):

select?q=*:*fq=(-color:[* TO *] OR color:blue)

When that returned zero hits, I added a new field called color.not_null 
and am setting it only if a color is defined (thinking there was a problem 
with using the same field name).


select?q=*:*fq=(-color.not_null:[* TO *] OR color:blue)

That too yielded zero results. Again, executing them separately does return 
hits (3).


Does anyone see what I might be doing wrong? Thanks in advance,
Kristian 



Re: Solr cloud date based paritioning

2013-07-02 Thread kowish.adamosh
Sure, I'ill measure results and come back if results will be unsatisfactory.
Thanks very much for advice.

Out of curiosity: is there any way to partition shards (logical and
physical) by specified value of specified field?

Kowish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729p4074899.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to show just the parent domains from results in Solr

2013-07-02 Thread Jack Krupansky
Re-index your data with a separate field for domain name, then either 
manually populate it or use an update processor to extract the domain name 
and store it in the desired field. You can then group by that field.


The URL Classify update processor can do the trick.

Or maybe a custom script with the Stateless Script update processor.

My book has examples for URL Classify.

-- Jack Krupansky

-Original Message- 
From: A Geek

Sent: Tuesday, July 02, 2013 1:47 PM
To: solr user
Subject: How to show just the parent domains from results in Solr

hi All, I've indexed documents in my Solr 4.0 index, with fields like URL, 
page_content etc. Now when I run a search query, against the page_content I 
get a lot of urls . And say, if I in total 15 URL domains, and under these 
15 domains I've all the pages indexed in SOLR.  Is there a way in which, I 
can just get the parent URLs for search results instead of getting all the 
urls.

For example, say searching for abc returns:
www.aa.com/11.html www.aa.com/12.htmlwww.aa.com/13.html 
www.bb.com/15.htmlwww.bb.com/18.html

I want the results to be like this:www.aa.comwww.bb.com
Is there a way in SOLR, through which I can achieve this. I've tried 
FieldCollapsing[ https://wiki.apache.org/solr/FieldCollapsing ] but either 
its not the right solution or I'm not able to use it properly. Could someone 
help me find the solution to the above problem. Thanks in advance.

Regards, KK





RE: How to query Solr for empty field or specific value

2013-07-02 Thread Van Tassell, Kristian
Thank you!

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Tuesday, July 02, 2013 3:05 PM
To: solr-user@lucene.apache.org
Subject: Re: How to query Solr for empty field or specific value

Better to define color.not_null as a boolean field and always initialize as 
either true or false.

But, even without that you need write a pure negative query or clause as

(*:* -term)

So:

select?q=*:*fq=((*:* -color:[* TO *]) OR color:blue)

and

select?q=*:*fq=((*:* -color.not_null:[* TO *]) OR color:blue)

-- Jack Krupansky

-Original Message-
From: Van Tassell, Kristian
Sent: Tuesday, July 02, 2013 3:47 PM
To: solr-user@lucene.apache.org
Subject: How to query Solr for empty field or specific value

Hello,

I'm using Solr 4.2 and am trying to get a specific value (blue) or null field 
(no color) returned by my filter query. My results should yield 3 documents (If 
I execute the two separate filters in different queries, I get
2 hits for one query and 1 for the other).

I've tried this (blue or no color set):

select?q=*:*fq=(-color:[* TO *] OR color:blue)

When that returned zero hits, I added a new field called color.not_null 
and am setting it only if a color is defined (thinking there was a problem with 
using the same field name).

select?q=*:*fq=(-color.not_null:[* TO *] OR color:blue)

That too yielded zero results. Again, executing them separately does return 
hits (3).

Does anyone see what I might be doing wrong? Thanks in advance, Kristian 



What are the options for obtaining IDF at interactive speeds?

2013-07-02 Thread Kathryn Mazaitis
Hi,

I'm using SOLRJ to run a query, with the goal of obtaining:

(1) the retrieved documents,
(2) the TF of each term in each document,
(3) the IDF of each term in the set of retrieved documents (TF/IDF would be
fine too)

...all at interactive speeds, or 10s per query. This is a demo, so if all
else fails I can adjust the corpus, but I'd rather, y'know, actually do it.

(1) and (2) are working; I completed the patch posted in the following
issue:
https://issues.apache.org/jira/browse/SOLR-949
and am just setting tv=truetv.tf=true for my query. This way I get the
documents and the tf information all in one go.

With (3) I'm running into trouble. I have found 2 ways to do it so far:

Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
information along with the documents and tf information. Since each term
may appear in multiple documents, this means retrieving idf information for
each term about 20 times, and takes over a minute to do.

Option B: After I've gathered the tf information, run through the list of
terms used across the set of retrieved documents, and for each term, run a
query like:
{!func}idf(text,'the_term')deftype=funcfl=scorerows=1
...while this retrieves idf information only once for each term, the added
latency for doing that many queries piles up to almost two minutes on my
current corpus.

Is there anything I didn't think of -- a way to construct a query to get
idf information for a set of terms all in one go, outside the bounds of
what terms happen to be in a document?

Failing that, does anyone have a sense for how far I'd have to scale down a
corpus to approach interactive speeds, if I want this sort of data?

Katie


Re: Two instances of solr - the same datadir?

2013-07-02 Thread Peter Sturge
The RO instance commit isn't (or shouldn't be) doing any real writing, just
an empty commit to force new searchers, autowarm/refresh caches etc.
Admittedly, we do all this on 3.6, so 4.0 could have different behaviour in
this area.
As long as you don't have autocommit in solrconfig.xml, there wouldn't be
any commits 'behind the scenes' (we do all our commits via a local solrj
client so it can be fully managed).
The only caveat might be NRT/soft commits, but I'm not too familiar with
this in 4.0.
In any case, your RO instance must be getting updated somehow, otherwise
how would it know your write instance made any changes?
Perhaps your write instance notifies the RO instance externally from Solr?
(a perfectly valid approach, and one that would allow a 'single' lock to
work without contention)



On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Interesting, we are running 4.0 - and solr will refuse the start (or
 reload) the core. But from looking at the code I am not seeing it is doing
 any writing - but I should digg more...

 Are you sure it needs to do writing? Because I am not calling commits, in
 fact I have deactivated *all* components that write into index, so unless
 there is something deep inside, which automatically calls the commit, it
 should never happen.

 roman


 On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com
 wrote:

  Hmmm, single lock sounds dangerous. It probably works ok because you've
  been [un]lucky.
  For example, even with a RO instance, you still need to do a commit in
  order to reload caches/changes from the other instance.
  What happens if this commit gets called in the middle of the other
  instance's commit? I've not tested this scenario, but it's very possible
  with a 'single' lock the results are indeterminate.
  If the 'single' lock mechanism is making assumptions e.g. no other
 process
  will interfere, and then one does, the Lucene index could very well get
  corrupted.
 
  For the error you're seeing using 'native', we use native lockType for
 both
  write and RO instances, and it works fine - no contention.
  Which version of Solr are you using? Perhaps there's been a change in
  behaviour?
 
  Peter
 
 
  On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   as i discovered, it is not good to use 'native' locktype in this
  scenario,
   actually there is a note in the solrconfig.xml which says the same
  
   when a core is reloaded and solr tries to grab lock, it will fail -
 even
  if
   the instance is configured to be read-only, so i am using 'single' lock
  for
   the readers and 'native' for the writer, which seems to work OK
  
   roman
  
  
   On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
I have auto commit after 40k RECs/1800secs. But I only tested with
  manual
commit, but I don't see why it should work differently.
Roman
On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com
 wrote:
   
If it makes you feel better, I also considered this approach when I
  was
   in
the same situation with a separate indexer and searcher on one
  Physical
linux machine.
   
My main concern was re-using the FS cache between both instances -
  If
   I
replicated to myself there would be two independent copies of the
  index,
FS-cached separately.
   
I like the suggestion of using autoCommit to reload the index. If
 I'm
reading that right, you'd set an autoCommit on 'zero docs changing',
  or
just 'every N seconds'? Did that work?
   
Best of luck!
   
Tim
   
   
On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:
   
 So here it is for a record how I am solving it right now:

 Write-master is started with: -Dmontysolr.warming.enabled=false
 -Dmontysolr.write.master=true -Dmontysolr.read.master=
 http://localhost:5005
 Read-master is started with: -Dmontysolr.warming.enabled=true
 -Dmontysolr.write.master=false


 solrconfig.xml changes:

 1. all index changing components have this bit,
 enable=${montysolr.master:true} - ie.

 updateHandler class=solr.DirectUpdateHandler2
  enable=${montysolr.master:true}

 2. for cache warming de/activation

 listener event=newSearcher
   class=solr.QuerySenderListener
   enable=${montysolr.enable.warming:true}...

 3. to trigger refresh of the read-only-master (from write-master):

 listener event=postCommit
   class=solr.RunExecutableListener
   enable=${montysolr.master:true}
   str name=execurl/str
   str name=dir./str
   bool name=waitfalse/bool
   arr name=args str${montysolr.read.master:
  http://localhost


   
  
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
 /listener

 This works, I still don't like the reload of the 

Partial Matching in both query and field

2013-07-02 Thread James Bathgate
Given a string of 123456 and a search query 923459, what should the
schema look like to consider this a match because at least 4 consecutive in
characters the query match 4 consecutive characters in the data? I'm trying
an NGramFilterFactory on the index and NGramTokenizerFactory on the query
in the schema, but that's not working.

I believe the problem is 'field:923459' is parsed as 'field:9234 2345
3459' instead of 'field:9234 field:2345 field:3459'.

[image: SearchSpring | Findability Unleashed]

James Bathgate | Sr. Developer

Toll Free (888) 643-9043 x610 - Fax (719) 358-2027

4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918
www.searchspring.net   http://www.searchspring.net


Re: Newbie SolR - Need advice

2013-07-02 Thread Sandeep Mestry
Hi Fabio,

Yes, you're on right track.

I'd like to now direct you to first reply from Jack to go through solr
tutorial.
Even with Solr,, it will take some time to learn various bits and pieces
about designing fields, their field types, server configuration, etc. and
then tune the results to match the results that you're currently getting
from the database. There is lots of info available for Solr on web and do
check Lucidworks' Solr Reference Guide.
http://docs.lucidworks.com/display/solr/Apache+Solr+Reference+Guide;jsessionid=16ED0DB3B6F6BE8CEC6E6CDB207DBC49

Best of Solr Luck!

Sandeep



On 2 July 2013 20:47, fabio1605 fabio.to...@btinternet.com wrote:


 So, you keep your mssql database, you just don't use it for searches -
 that'll relieve some of the load. Searches then all go through SOLR  its
 Lucene indexes. If your various tables need SQL joins, you specify those in
 the DataImportHandler (DIH) config. That way, when SOLR indexes everything,
 it indexes the data the way you want to see it.

 -- SO  by this you mean we keep mssql as we do!!

 But we use the website to run through SOLR SOLR will then handle the
 indexing and retrieval of data from its own index's, and will make its own
 calls to our MSSQL server when required(i.e updating/adding to
 indexs..)

 Am I on the right tracks there now!

 So MSSQL becomes the datastore
 SOLR becomes the search engine...





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074889.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Partial Matching in both query and field

2013-07-02 Thread Jack Krupansky
You will need to set q.op to OR, and... use a field type that has the 
autoGeneratePhraseQueries attribute set to false.


-- Jack Krupansky

-Original Message- 
From: James Bathgate

Sent: Tuesday, July 02, 2013 5:10 PM
To: solr-user@lucene.apache.org
Subject: Partial Matching in both query and field

Given a string of 123456 and a search query 923459, what should the
schema look like to consider this a match because at least 4 consecutive in
characters the query match 4 consecutive characters in the data? I'm trying
an NGramFilterFactory on the index and NGramTokenizerFactory on the query
in the schema, but that's not working.

I believe the problem is 'field:923459' is parsed as 'field:9234 2345
3459' instead of 'field:9234 field:2345 field:3459'.

[image: SearchSpring | Findability Unleashed]

James Bathgate | Sr. Developer

Toll Free (888) 643-9043 x610 - Fax (719) 358-2027

4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918
www.searchspring.net   http://www.searchspring.net 



Re: Partial Matching in both query and field

2013-07-02 Thread James Bathgate
Jack,

I've already tried that, here's my query:

str name=debugQueryon/str
str name=indenton/str
str name=start0/str
str name=q0_extrafield1_n:20454/str
str name=q.opOR/str
str name=rows10/str
str name=version2.2/str

Here's the parsed query:

str name=parsedquery_toString0_extrafield1_n:2o45 o454 2o454/str

Here's the applicable lines from schema.xml:

fieldType name=ngram class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=1 splitOnCaseChange=0
splitOnNumerics=0 preserveOriginal=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=0
replacement=o replace=all/
filter class=solr.PatternReplaceFilterFactory pattern=1|l
replacement=i replace=all/
filter class=solr.NGramFilterFactory minGramSize=4
maxGramSize=16/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.NGramTokenizerFactory minGramSize=4
maxGramSize=16 /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.PatternReplaceFilterFactory
pattern=[^A-Za-z0-9]+ replacement= replace=all/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=0
replacement=o replace=all/
filter class=solr.PatternReplaceFilterFactory pattern=1|l
replacement=i replace=all/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

dynamicField name=*_n type=ngram indexed=true stored=true
autoGeneratePhraseQueries=false /


James


[image: SearchSpring | Findability Unleashed]

James Bathgate | Sr. Developer

Toll Free (888) 643-9043 x610 - Fax (719) 358-2027

4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918
www.searchspring.net   http://www.searchspring.net


On Tue, Jul 2, 2013 at 2:22 PM, Jack Krupansky j...@basetechnology.comwrote:

 You will need to set q.op to OR, and... use a field type that has the
 autoGeneratePhraseQueries attribute set to false.

 -- Jack Krupansky

 -Original Message- From: James Bathgate
 Sent: Tuesday, July 02, 2013 5:10 PM
 To: solr-user@lucene.apache.org
 Subject: Partial Matching in both query and field


 Given a string of 123456 and a search query 923459, what should the
 schema look like to consider this a match because at least 4 consecutive in
 characters the query match 4 consecutive characters in the data? I'm trying
 an NGramFilterFactory on the index and NGramTokenizerFactory on the query
 in the schema, but that's not working.

 I believe the problem is 'field:923459' is parsed as 'field:9234 2345
 3459' instead of 'field:9234 field:2345 field:3459'.

 [image: SearchSpring | Findability Unleashed]

 James Bathgate | Sr. Developer

 Toll Free (888) 643-9043 x610 - Fax (719) 358-2027

 4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918
 www.searchspring.net   http://www.searchspring.net



Re: Two instances of solr - the same datadir?

2013-07-02 Thread Michael Della Bitta
Wouldn't it be better to do a RELOAD?

http://wiki.apache.org/solr/CoreAdmin#RELOAD

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Tue, Jul 2, 2013 at 5:05 PM, Peter Sturge peter.stu...@gmail.com wrote:

 The RO instance commit isn't (or shouldn't be) doing any real writing, just
 an empty commit to force new searchers, autowarm/refresh caches etc.
 Admittedly, we do all this on 3.6, so 4.0 could have different behaviour in
 this area.
 As long as you don't have autocommit in solrconfig.xml, there wouldn't be
 any commits 'behind the scenes' (we do all our commits via a local solrj
 client so it can be fully managed).
 The only caveat might be NRT/soft commits, but I'm not too familiar with
 this in 4.0.
 In any case, your RO instance must be getting updated somehow, otherwise
 how would it know your write instance made any changes?
 Perhaps your write instance notifies the RO instance externally from Solr?
 (a perfectly valid approach, and one that would allow a 'single' lock to
 work without contention)



 On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com wrote:

  Interesting, we are running 4.0 - and solr will refuse the start (or
  reload) the core. But from looking at the code I am not seeing it is
 doing
  any writing - but I should digg more...
 
  Are you sure it needs to do writing? Because I am not calling commits, in
  fact I have deactivated *all* components that write into index, so unless
  there is something deep inside, which automatically calls the commit, it
  should never happen.
 
  roman
 
 
  On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com
  wrote:
 
   Hmmm, single lock sounds dangerous. It probably works ok because you've
   been [un]lucky.
   For example, even with a RO instance, you still need to do a commit in
   order to reload caches/changes from the other instance.
   What happens if this commit gets called in the middle of the other
   instance's commit? I've not tested this scenario, but it's very
 possible
   with a 'single' lock the results are indeterminate.
   If the 'single' lock mechanism is making assumptions e.g. no other
  process
   will interfere, and then one does, the Lucene index could very well get
   corrupted.
  
   For the error you're seeing using 'native', we use native lockType for
  both
   write and RO instances, and it works fine - no contention.
   Which version of Solr are you using? Perhaps there's been a change in
   behaviour?
  
   Peter
  
  
   On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
as i discovered, it is not good to use 'native' locktype in this
   scenario,
actually there is a note in the solrconfig.xml which says the same
   
when a core is reloaded and solr tries to grab lock, it will fail -
  even
   if
the instance is configured to be read-only, so i am using 'single'
 lock
   for
the readers and 'native' for the writer, which seems to work OK
   
roman
   
   
On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
   
 I have auto commit after 40k RECs/1800secs. But I only tested with
   manual
 commit, but I don't see why it should work differently.
 Roman
 On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com
  wrote:

 If it makes you feel better, I also considered this approach when
 I
   was
in
 the same situation with a separate indexer and searcher on one
   Physical
 linux machine.

 My main concern was re-using the FS cache between both
 instances -
   If
I
 replicated to myself there would be two independent copies of the
   index,
 FS-cached separately.

 I like the suggestion of using autoCommit to reload the index. If
  I'm
 reading that right, you'd set an autoCommit on 'zero docs
 changing',
   or
 just 'every N seconds'? Did that work?

 Best of luck!

 Tim


 On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:

  So here it is for a record how I am solving it right now:
 
  Write-master is started with: -Dmontysolr.warming.enabled=false
  -Dmontysolr.write.master=true -Dmontysolr.read.master=
  http://localhost:5005
  Read-master is started with: -Dmontysolr.warming.enabled=true
  -Dmontysolr.write.master=false
 
 
  solrconfig.xml changes:
 
  1. all index changing components have this bit,
  enable=${montysolr.master:true} - ie.
 
  updateHandler class=solr.DirectUpdateHandler2
   enable=${montysolr.master:true}
 
  2. for cache warming de/activation
 
  listener event=newSearcher
class=solr.QuerySenderListener
   

Re: Partial Matching in both query and field

2013-07-02 Thread Jack Krupansky
Ahhh... you put autoGeneratePhraseQueries=false  on the field - but it 
needs to be on the field type.


You can see from the parsed query that it generated the phrase.

-- Jack Krupansky

-Original Message- 
From: James Bathgate

Sent: Tuesday, July 02, 2013 5:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Partial Matching in both query and field

Jack,

I've already tried that, here's my query:

str name=debugQueryon/str
str name=indenton/str
str name=start0/str
str name=q0_extrafield1_n:20454/str
str name=q.opOR/str
str name=rows10/str
str name=version2.2/str

Here's the parsed query:

str name=parsedquery_toString0_extrafield1_n:2o45 o454 2o454/str

Here's the applicable lines from schema.xml:

   fieldType name=ngram class=solr.TextField
positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=1 splitOnCaseChange=0
splitOnNumerics=0 preserveOriginal=0/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PatternReplaceFilterFactory pattern=0
replacement=o replace=all/
   filter class=solr.PatternReplaceFilterFactory pattern=1|l
replacement=i replace=all/
   filter class=solr.NGramFilterFactory minGramSize=4
maxGramSize=16/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.NGramTokenizerFactory minGramSize=4
maxGramSize=16 /
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
   filter class=solr.PatternReplaceFilterFactory
pattern=[^A-Za-z0-9]+ replacement= replace=all/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PatternReplaceFilterFactory pattern=0
replacement=o replace=all/
   filter class=solr.PatternReplaceFilterFactory pattern=1|l
replacement=i replace=all/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
   /fieldType

dynamicField name=*_n type=ngram indexed=true stored=true
autoGeneratePhraseQueries=false /


James


[image: SearchSpring | Findability Unleashed]

James Bathgate | Sr. Developer

Toll Free (888) 643-9043 x610 - Fax (719) 358-2027

4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918
www.searchspring.net   http://www.searchspring.net


On Tue, Jul 2, 2013 at 2:22 PM, Jack Krupansky 
j...@basetechnology.comwrote:



You will need to set q.op to OR, and... use a field type that has the
autoGeneratePhraseQueries attribute set to false.

-- Jack Krupansky

-Original Message- From: James Bathgate
Sent: Tuesday, July 02, 2013 5:10 PM
To: solr-user@lucene.apache.org
Subject: Partial Matching in both query and field


Given a string of 123456 and a search query 923459, what should the
schema look like to consider this a match because at least 4 consecutive 
in
characters the query match 4 consecutive characters in the data? I'm 
trying

an NGramFilterFactory on the index and NGramTokenizerFactory on the query
in the schema, but that's not working.

I believe the problem is 'field:923459' is parsed as 'field:9234 2345
3459' instead of 'field:9234 field:2345 field:3459'.

[image: SearchSpring | Findability Unleashed]

James Bathgate | Sr. Developer

Toll Free (888) 643-9043 x610 - Fax (719) 358-2027

4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918
www.searchspring.net   http://www.searchspring.net





Re: Access to Solr Wiki

2013-07-02 Thread Steve Rowe
I've added GoraMohanty to the Solr wiki's ContributorsGroup page. - Steve

On Jul 2, 2013, at 3:25 PM, Gora Mohanty g...@mimirtech.com wrote:

 Hi,
 
 May I please be added to the list of editors to the
 Solr Wiki as I see that some earlier changes seem
 to have gone missing. My user name is GoraMohanty
 Thanks.
 
 Regards,
 Gora



  1   2   >