Group by multiple fields

2013-06-06 Thread Benjamin Ryan
Hi,
   Is it possible to create a query similar in function to multiple 
SQL group by clauses?
   I have documents that have a single valued fields for host name 
and collection name and would like to group the results by both e.g. a result 
would contain a count of the documents grouped by both fields:

   Hostname1 collection1 456
   Hostname1 collection2 567
   Hostname2 collection1 123
   Hostname2 collection2 789

   This is on Solr 3.3 (could be on 4.x) and both fields are single 
valued with the type:

   fieldType name=lowerCaseSort class=solr.TextField 
sortMissingLast=true omitNorms=true
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
  filter class=solr.TrimFilterFactory /
  /analyzer
   /fieldType

   field name=collectionName type=lowerCaseSort indexed=true 
stored=true multiValued=false required=true omitNorms=true /
   field name=hostName type=lowerCaseSort indexed=true 
stored=true multiValued=false required=true omitNorms=true /

Regards,
   Ben

--
Dr Ben Ryan
Jorum Technical Manager

5.12 Roscoe Building
The University of Manchester
Oxford Road
Manchester
M13 9PL
Tel: 0160 275 6039
E-mail: 
benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk
--



Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Prathik Puthran
My use case is I want to search for any substring of the indexed string and
the Suggester should suggest the indexed string. What can I do to make this
work?

Thanks,
Prathik


On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Please excuse my misunderstanding, but I always wonder why this index time
 processing is suggested usually. from my POV is the case for query-time
 processing i.e. PrefixQuery aka wildcard query Jason* .
 Ultra-fast term retrieval also provided by TermsComponent.


 On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  ngrams?
 
  See:
  http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
  apache/lucene/analysis/ngram/**NGramFilterFactory.html
 http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html
 
 
  -- Jack Krupansky
 
  -Original Message- From: Prathik Puthran
  Sent: Wednesday, June 05, 2013 11:59 AM
  To: solr-user@lucene.apache.org
  Subject: Configuring lucene to suggest the indexed string for all the
  searches of the substring of the indexed string
 
 
  Hi,
 
  Is it possible to configure solr to suggest the indexed string for all
 the
  searches of the substring of the string?
 
  Thanks,
  Prathik
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Upayavira
Can you se the ShingleFilterFactory? It is ngrams for terms rather than
characters. If you limited it to two term ngrams, when the user presses
space after their first word, you could do a suggested query against
your two term ngram field, which would suggest Jason Bourne, Jason
Statham, etc then you press space after Jason.

Upayavira

On Thu, Jun 6, 2013, at 07:25 AM, Prathik Puthran wrote:
 My use case is I want to search for any substring of the indexed string
 and
 the Suggester should suggest the indexed string. What can I do to make
 this
 work?
 
 Thanks,
 Prathik
 
 
 On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev
 mkhlud...@griddynamics.com
  wrote:
 
  Please excuse my misunderstanding, but I always wonder why this index time
  processing is suggested usually. from my POV is the case for query-time
  processing i.e. PrefixQuery aka wildcard query Jason* .
  Ultra-fast term retrieval also provided by TermsComponent.
 
 
  On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky j...@basetechnology.com
  wrote:
 
   ngrams?
  
   See:
   http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
   apache/lucene/analysis/ngram/**NGramFilterFactory.html
  http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html
  
  
   -- Jack Krupansky
  
   -Original Message- From: Prathik Puthran
   Sent: Wednesday, June 05, 2013 11:59 AM
   To: solr-user@lucene.apache.org
   Subject: Configuring lucene to suggest the indexed string for all the
   searches of the substring of the indexed string
  
  
   Hi,
  
   Is it possible to configure solr to suggest the indexed string for all
  the
   searches of the substring of the string?
  
   Thanks,
   Prathik
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 


SOLR CSV output in custom order

2013-06-06 Thread anurag.jain
I want output of csv file in proper order.  when I use wt=csv  it gives
output in random order. Is there any way to get output in proper format. 

Thanks 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-CSV-output-in-custom-order-tp4068527.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Prathik Puthran
This works even now i.e. when I search for Jas it suggests Jason
Bourne. What I want is when I search for Bour or ason (any substring)
it should suggest me Jason Bourne .


On Thu, Jun 6, 2013 at 12:34 PM, Upayavira u...@odoko.co.uk wrote:

 Can you se the ShingleFilterFactory? It is ngrams for terms rather than
 characters. If you limited it to two term ngrams, when the user presses
 space after their first word, you could do a suggested query against
 your two term ngram field, which would suggest Jason Bourne, Jason
 Statham, etc then you press space after Jason.

 Upayavira

 On Thu, Jun 6, 2013, at 07:25 AM, Prathik Puthran wrote:
  My use case is I want to search for any substring of the indexed string
  and
  the Suggester should suggest the indexed string. What can I do to make
  this
  work?
 
  Thanks,
  Prathik
 
 
  On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev
  mkhlud...@griddynamics.com
   wrote:
 
   Please excuse my misunderstanding, but I always wonder why this index
 time
   processing is suggested usually. from my POV is the case for query-time
   processing i.e. PrefixQuery aka wildcard query Jason* .
   Ultra-fast term retrieval also provided by TermsComponent.
  
  
   On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky 
 j...@basetechnology.com
   wrote:
  
ngrams?
   
See:
http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
apache/lucene/analysis/ngram/**NGramFilterFactory.html
  
 http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html
   
   
-- Jack Krupansky
   
-Original Message- From: Prathik Puthran
Sent: Wednesday, June 05, 2013 11:59 AM
To: solr-user@lucene.apache.org
Subject: Configuring lucene to suggest the indexed string for all the
searches of the substring of the indexed string
   
   
Hi,
   
Is it possible to configure solr to suggest the indexed string for
 all
   the
searches of the substring of the string?
   
Thanks,
Prathik
   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
mkhlud...@griddynamics.com
  



Re: Solr: separating index and storage

2013-06-06 Thread Sourajit Basak
Absolutely. Solr will return the reference along the docs/results; those
references may be used to look-up the actual stuff. Such use cases aren't
hard to solve.

If the use case demands returning the actual stuff alongside the results,
it becomes non-trivial, especially during high loads.

To avoid this and do a quick implementation I can judiciously create stored
fields and see how it performs. I will need to figure out what happens if
the volume growth of stored fields is high, how much is the disk I/O and
what happens if we shard the index, like, what happens to the stored fields
then.

Best,
Sourajit




On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson erickerick...@gmail.comwrote:

 You have to index something with your Solr documents that
 has meaning in _your_ system so you can find the
 original record. You don't search this field, you just
 return it with the search results and then use it to get
 the original document.

 If you're storing the original in a DB, this can be the PK.
 If on a file system the path. etc.

 Essentially, since the association is specific to your environment
 you need to handle it explicitly...

 Best
 Erick

 On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak
 sourajit.ba...@gmail.com wrote:
  Consider the following use case.
 
  Certain words are extracted from a document and indexed. The exact
 sentence
  containing the word cannot be stored alongside the extracted word because
  of the volume at which the documents grow; How can the index and, lets
 call
  it doc servers be separated ?
 
  An option is to store the sentences in MongoDB or a RDBMS. But there
 seems
  to be a schema level design issue. Assuming 'word' to be a multivalued
  field, how do we associate to it a reference to the corresponding entry
 in
  the doc server.
 
  May create (word_1, ref_1) tuples. Is there any other in-built feature ?
 
  Any related project which separates index  doc servers ?
 
  Thanks,
  Sourajit



Filtering on results with more than N words.

2013-06-06 Thread Dotan Cohen
Is there any way to restrict the search results to only those
documents with more than N words / tokens in the searched field? I
thought that this would be an easy one to Google for, but I cannot
figure it out. or find any references. There are many references to
word size in characters, but not to  filed size in words.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: data-import problem

2013-06-06 Thread Stavros Delisavas
I tryed to deactivate the uniquekey, but that made solr not work at all. 
I got Error 500 for everything (no admin page, etc). So I had to 
reactivate it.


This is my current configuration as you recommended. Unfortunatly still 
no improvement. The second table doesn't get recorded. I included the 
errormessage of the log file.


http://pastebin.com/0vut38qL

Has no one ever successfully imported two tables into solr before?



Am 06.06.2013 00:01, schrieb bbarani:

A Solr index does not need a unique key, but almost all indexes use one.

http://wiki.apache.org/solr/UniqueKey

Try the below query passing id as id instead of titleid..

document
  entity name=title query=SELECT id, title FROM
name/entity
/document

A proper dataimport config will look like,

entity name=relationship_entity query=select id,name,value 
from
table
field column=id name=idSchemaFieldName/
field column=name name=nameSchemaFieldName/
field column=value name=valueSchemaFieldName /
/entity



--
View this message in context: 
http://lucene.472066.n3.nabble.com/data-import-problem-tp4068345p4068447.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Heap space problem with mlt query

2013-06-06 Thread Varsha Rani
Hi,

As per suggestions , changed  in my config file  as :
 reduced document cache size from 31067 to 16384 and
 autowarmcount from 2046 to 1024.

My machine RAM size is 16GB , 1 GB RAM used as index of 85GB started.

 my config file as :

ramBufferSizeMB128/ramBufferSizeMB

filterCache
   class=solr.FastLRUCache
size=16384
   initialSize=4096
  autowarmCount=1024
   cleanupThread=true/

 queryResultCache
class=solr.FastLRUCache
size=16384
   initialSize=4096
   autowarmCount=1024
cleanupThread=true/


 documentCache
   class=solr.FastLRUCache
   size=16384
   initialSize=4096
  autowarmCount=1024
  cleanupThread=true/




I am running 20-25 mlt queries in 1 sec . As with each mlt query RAM used
increases continuously.  As RAM used reached to 6GB, java heap space problem
occur. With each 5 continuous mlt queries RAM used increased by 1GB.








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Heap-space-problem-with-mlt-query-tp4068278p4068541.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Heap space problem with mlt query

2013-06-06 Thread Stavros Delisavas
I recently had the same issue which could be fixed very easily. Add the 
property batchSize=-1 to your dataSource-tag.


Tell me if that helped.

Am 06.06.2013 11:30, schrieb Varsha Rani:

Hi,

As per suggestions , changed  in my config file  as :
  reduced document cache size from 31067 to 16384 and
  autowarmcount from 2046 to 1024.

My machine RAM size is 16GB , 1 GB RAM used as index of 85GB started.

  my config file as :

ramBufferSizeMB128/ramBufferSizeMB

filterCache
class=solr.FastLRUCache
 size=16384
initialSize=4096
   autowarmCount=1024
cleanupThread=true/

  queryResultCache
 class=solr.FastLRUCache
 size=16384
initialSize=4096
autowarmCount=1024
 cleanupThread=true/


  documentCache
class=solr.FastLRUCache
size=16384
initialSize=4096
   autowarmCount=1024
   cleanupThread=true/




I am running 20-25 mlt queries in 1 sec . As with each mlt query RAM used
increases continuously.  As RAM used reached to 6GB, java heap space problem
occur. With each 5 continuous mlt queries RAM used increased by 1GB.








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Heap-space-problem-with-mlt-query-tp4068278p4068541.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Solr 4.2.1 higher memory footprint vs Solr 3.5

2013-06-06 Thread Bernd Fehling


Am 05.06.2013 18:09, schrieb SandeepM:
 /So we see the jagged edge waveform which keeps climbing (GC cycles don't
 completely collect memory over time).  Our test has a short capture from
 real traffic and we are replaying that via solrmeter./
 
 Any idea why the memory climbs over time.  The GC should cleanup after data
 is shipped back.  Could there be a memory leak in SOLR?

Sorting can be a killer so take care about sorting only on fields prepared for 
sorting
and on fields which really make sense to sort.
The ultimate test/killer is sorting and faceting on doc id (the id field).
The problem is most times the fieldCache:
Provides introspection of the Lucene FieldCache, this is **NOT** a cache that 
is managed by Solr.

Yes, the memory footprint right after start is higher for 4.2.1 than for 3.5, 
but
during runtime it is much lower with 4.2.1 (because of FST).

As for my system (45 mio. docs / 125 GB index) with 3.x I had 4 to 5 GB right 
after start
and 10 to 12 GB during runtime after several days. Now I have with 4.2.1 about 
6 GB right
after start and are between 6.5 to 8 GB at runtime.
I don't see any increase in memory usage even after weeks, only peaks to 10 GB 
during replication
which will then decrease back to normal (6.5 to 8 GB).
So definately no memory leak.

What helped me a lot was switching to G1GC.
Faster, smoother, very little ripple, nearly no sawtooth.

Bernd


Solr indexing slows down

2013-06-06 Thread Sebastian Steinfeld
Hi,

I am new to solr and we want to use Solr to speed up our product search.
And it is working really nice, but I think I have a problem with the indexing.
It slows down after a few minutes.

I am using the DataImportHandler to import the products from the database.
And I start the import by executing the following HTTP request:
/dataimport?command=full-importclean=truecommit=true

I guess this are the importend parts of my configuration:

schema.xml:
--
fields
   field name=pk   type=longindexed=true  
stored=true required=true  /
   field name=code type=string  indexed=true  
stored=true required=true  / 
   field name=ean  type=string  indexed=true  
stored=false  /
   field name=name type=lowercase   indexed=true  
stored=false  /
   field name=text type=text_general indexed=true stored=false 
multiValued=true/
   field name=_version_ type=long indexed=true stored=true/
/fields

fieldType name=lowercase class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
  /analyzer
/fieldType
--

solrconfig.xml:
--
  requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdataimport-handler.xml/str
/lst
  /requestHandler
--

dataimport-handler.xml:
--
dataConfig
dataSource name=local driver==*  
url=* 
user=* 
password=* 
/
   document
entity name=product pk=PRODUCTS_PK dataSource=local
query=SELECT   PRODUCTS_PK, PRODUCTS_CODE, 
PRODUCTS_EAN, PRODUCTSLP_NAME FROM V_SOLR_IMPORT4PRODUCT_SEARCH
field column=PRODUCTS_PK   name=pk /
field column=PRODUCTS_CODE name=code /
field column=PRODUCTS_EAN  name=ean /
field column=PRODUCTSLP_NAME   name=name /
/entity
/document
/dataConfig
--

The amout of documents I want to index is 8 million, the first 1,6 million are 
indexed in 2min, but to complete the Import it takes nearly 2 hours.
The size of the index on the hard drive is 610MB.
I started the solr server with 2GB memory.


I read that the duration of indexing might be connected to the batch size, so I 
increased the batchSize in the dataSource to 10.000, but this didn't make any 
differences.
I also tried to disable the autocommit, which is configured in the 
solrconfig.xml. I disabled it by uncommenting it, but this also didn't made any 
differences.

It would be realy nice if someone of you could help me with this problem.

Thank you very much,
Sebastian



Re: Heap space problem with mlt query

2013-06-06 Thread Varsha Rani
Hi Stavros,

I checked it with batchSize=-1, But still the same issue.


As my single mlt query is :



http://machine_ip:8983/solr/News/mlt?q=field1:34358471qt=/mltmlt.match.include=truemlt=truemlt.mindf=1mlt.mintf=1mlt.minwl=3mlt.boost=truefq=cat:News;
AND date:[136644000 TO 1362827444000]  AND -category:General
mlt.qf=field2^2mlt.fl=field3mlt.count=30start=0fl=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10sort=score
descrows=50wt=xmlversion=2.2

Exception i faced is :


org.apache.solr.client.solrj.SolrServerException: Error executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:96)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:113)
at org.orkash.automatic.mltdetector.main(mltdetector.java:212)
Caused by: org.apache.solr.common.SolrException: Error in
xpath:/config/luceneMatchVersion for solrconfig.xml 
org.apache.solr.common.SolrException: Error in
xpath:/config/luceneMatchVersion for solrconfig.xml at
org.apache.solr.core.Config.getNode(Config.java:197)at
org.apache.solr.core.Config.getVal(Config.java:202) at
org.apache.solr.core.Config.getLuceneVersion(Config.java:271)   at
org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:70)  at
org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:66)  at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:68) 
at
org.apache.solr.search.QParser.getQuery(QParser.java:143)   at
org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:95)
 
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298)at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340) 
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
 
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) 
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) 
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) 
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 
at org.mortbay.jetty.Server.handle(Server.java:326) at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) 
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 
at org.mortbay.jetty.HttpParser.parseNext(HttpPa

Error in xpath:/config/luceneMatchVersion for solrconfig.xml 
org.apache.solr.common.SolrException: Error in
xpath:/config/luceneMatchVersion for solrconfig.xml at
org.apache.solr.core.Config.getNode(Config.java:197)at
org.apache.solr.core.Config.getVal(Config.java:202) at
org.apache.solr.core.Config.getLuceneVersion(Config.java:271)   at
org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:70)  at
org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:66)  at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:68) 
at
org.apache.solr.search.QParser.getQuery(QParser.java:143)   at
org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:95)
 
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298)at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340) 
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
 
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) 
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) 
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) 
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 
at org.mortbay.jetty.Server.handle(Server.java:326) at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) 
at

Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Robert Krüger
On Wed, Jun 5, 2013 at 9:12 PM, Jack Krupansky j...@basetechnology.com wrote:
 Look in the Solr log - the error message should tell you what the multiple
 values are. For example,

 95484 [qtp2998209-11] ERROR org.apache.solr.core.SolrCore  –
 org.apache.solr.common.SolrException: ERROR: [doc=doc-1] multiple values
 encountered for non multiValued field content_s: [def, abc]

 One of the values should be the value of the field that is the source of the
 copyField. Maybe the other value will give you a clue as to where it came
 from.

 Check your SolrJ code - maybe you actually do try to initialize a value in
 the field that is the copyField target.

I see the values in the stack trace:

org.apache.solr.common.SolrException: ERROR:
[doc=8f60d040-3462-4b28-998f-fd05a64f1cd8:/] multiple values
encountered for non multiValued field name2: [rename, rename]

It is just twice the value of source-field and I am not referencing
that field in my java code.


Re: Search across multiple collections

2013-06-06 Thread Erick Erickson
You pretty much need to issue separate
queries against each collection and creatively
combine them. All of Solr's distributed search
stuff pre-supposes two things
1 the schemas are very similar
2 the types of docs in each collection are also
 very similar.

2 is a bit subtle. If you store different kinds of
docs in different cores, then that statistics for
term frequency etc. will be different. There's some
work being done (I think) to support distributed
tf/idf. But anyway, in this case the scores of the
docs from one collection will tend to dominate the
result set.

Or if you're talking about joining, see Anria's comments.

Best
Erick

On Wed, Jun 5, 2013 at 7:34 PM,  abillav...@innoventsolutions.com wrote:
 hi
 I've successfully searched over several separate collections (cores with
 unique schemas) using this kind of syntax.  This demonstrates a 2 core
 search

 http://localhost:8983/solr/collection1/select?
 q=my phrase to search on
 start=0
 rows=25
 fl=*,score
 fq={!join+fromIndex=collection2+from=sku+to=sku}id:1571


 I've split up the parameters so you see easily
 fq={!join+fromIndex=collection2+from=sku+to=sku}id:1571

 -- collection1/select  = use the select requestHandler out of collection1
 as a base
 -- collection2 is the 2nd core : equivalent of a table join in SQL
 -- sku is the field shared in both collection1, and collection2
 -- id is the field I want to find the id=1571 in.

 Hope this helps
 Anria




 On 2013-06-05 16:17, bbarani wrote:

 I am not sure the best way to search across multiple collection using SOLR
 4.3.

 Suppose, each collection have their own config files and I perform various
 operations on collections individually but when I search I want the search
 to happen across all collections. Can someone let me know how to perform
 search on multiple collections? Do I need to use sharding again?



 --
 View this message in context:

 http://lucene.472066.n3.nabble.com/Search-across-multiple-collections-tp4068469.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering on results with more than N words.

2013-06-06 Thread Jack Krupansky
I don't recall seeing any such filter. Sounds like a good idea though. 
Although, maybe it is another good idea that really isn't too necessary for 
solving many real world problems.


-- Jack Krupansky

-Original Message- 
From: Dotan Cohen

Sent: Thursday, June 06, 2013 3:45 AM
To: solr-user@lucene.apache.org
Subject: Filtering on results with more than N words.

Is there any way to restrict the search results to only those
documents with more than N words / tokens in the searched field? I
thought that this would be an easy one to Google for, but I cannot
figure it out. or find any references. There are many references to
word size in characters, but not to  filed size in words.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com 



Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Robert Krüger
I don't know what I have to do to use the atomic update feature but I
am not aware of using it. But the way you describe it, it means that
the copyField directive does not overwrite the existing field content
and that's an easy explanation to what is happening in my case. Then
the second update (which I do manually, i.e. read current state,
manipulate fields and then add the document with the same id) will
lead to this. That was not so obvious to me from the docs.

Thanks,

Robert

On Thu, Jun 6, 2013 at 12:18 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : I updated the Index using SolrJ and got the exact same error message

 there aren't a lot of specifics provided in this thread, so this may not
 be applicable, but if you mean you actaully using the atomic updates
 feature to update an existing document then the problem is that you still
 have the existing value in your name2 field, as well as another copy of
 the name field evaluated by copyField after the updates are applied...

 http://wiki.apache.org/solr/Atomic_Updates#Stored_Values


 -Hoss


Re: Group by multiple fields

2013-06-06 Thread Erick Erickson
There may be a terminology problem here. In Solr land, grouping
aka field collapsing governs how the results are returned. But
from your example, it looks like you really want summary counts
rather than return documents grouped by some field.

If you want counting, take a look at pivot faceting, which
is only available in 4.x Here's a place to start:
http://wiki.apache.org/solr/HierarchicalFaceting#Pivot_Facets

And if I've missed the boat entirely, can you clarify?

Best
Erick

On Thu, Jun 6, 2013 at 2:00 AM, Benjamin Ryan
benjamin.r...@manchester.ac.uk wrote:
 Hi,
Is it possible to create a query similar in function to 
 multiple SQL group by clauses?
I have documents that have a single valued fields for host 
 name and collection name and would like to group the results by both e.g. a 
 result would contain a count of the documents grouped by both fields:

Hostname1 collection1 456
Hostname1 collection2 567
Hostname2 collection1 123
Hostname2 collection2 789

This is on Solr 3.3 (could be on 4.x) and both fields are 
 single valued with the type:

fieldType name=lowerCaseSort class=solr.TextField 
 sortMissingLast=true omitNorms=true
   analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
   filter class=solr.TrimFilterFactory 
 /
   /analyzer
/fieldType

field name=collectionName type=lowerCaseSort 
 indexed=true stored=true multiValued=false required=true 
 omitNorms=true /
field name=hostName type=lowerCaseSort indexed=true 
 stored=true multiValued=false required=true omitNorms=true /

 Regards,
Ben

 --
 Dr Ben Ryan
 Jorum Technical Manager

 5.12 Roscoe Building
 The University of Manchester
 Oxford Road
 Manchester
 M13 9PL
 Tel: 0160 275 6039
 E-mail: 
 benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk
 --



Re: SOLR CSV output in custom order

2013-06-06 Thread Erick Erickson
What happens if you include a sort clause? Warning, I've
never tried it myself...

Best
Erick

On Thu, Jun 6, 2013 at 3:11 AM, anurag.jain anurag.k...@gmail.com wrote:
 I want output of csv file in proper order.  when I use wt=csv  it gives
 output in random order. Is there any way to get output in proper format.

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-CSV-output-in-custom-order-tp4068527.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr: separating index and storage

2013-06-06 Thread Erick Erickson
By and large, stored fields are pretty irrelevant for resource
consumption _except_ for
disk space consumed. Sharded systems work fine, the
stored data is stored in the index files (*.fdt and *.fdx) files in
each segment on each shard.

But you haven't told us anything about your data. How much are
you talking about here? 100s of G? Terabytes? Other than disk
space, You may well be anticipating problems that don't exist...

Now, when _returning_ documents the fields must be read, so
there is some resource consumption there which you can
mitigate with lazy field loading. But this is usually just a few docs
so often isn't a problem.

Best
Erick

On Thu, Jun 6, 2013 at 3:34 AM, Sourajit Basak sourajit.ba...@gmail.com wrote:
 Absolutely. Solr will return the reference along the docs/results; those
 references may be used to look-up the actual stuff. Such use cases aren't
 hard to solve.

 If the use case demands returning the actual stuff alongside the results,
 it becomes non-trivial, especially during high loads.

 To avoid this and do a quick implementation I can judiciously create stored
 fields and see how it performs. I will need to figure out what happens if
 the volume growth of stored fields is high, how much is the disk I/O and
 what happens if we shard the index, like, what happens to the stored fields
 then.

 Best,
 Sourajit




 On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson erickerick...@gmail.comwrote:

 You have to index something with your Solr documents that
 has meaning in _your_ system so you can find the
 original record. You don't search this field, you just
 return it with the search results and then use it to get
 the original document.

 If you're storing the original in a DB, this can be the PK.
 If on a file system the path. etc.

 Essentially, since the association is specific to your environment
 you need to handle it explicitly...

 Best
 Erick

 On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak
 sourajit.ba...@gmail.com wrote:
  Consider the following use case.
 
  Certain words are extracted from a document and indexed. The exact
 sentence
  containing the word cannot be stored alongside the extracted word because
  of the volume at which the documents grow; How can the index and, lets
 call
  it doc servers be separated ?
 
  An option is to store the sentences in MongoDB or a RDBMS. But there
 seems
  to be a schema level design issue. Assuming 'word' to be a multivalued
  field, how do we associate to it a reference to the corresponding entry
 in
  the doc server.
 
  May create (word_1, ref_1) tuples. Is there any other in-built feature ?
 
  Any related project which separates index  doc servers ?
 
  Thanks,
  Sourajit



Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Jack Krupansky

1. Try a simple curl command to add the document.

2. Check to see if maybe there is a duplicate copyField directive in your 
schema. How many copyField directives do you have?


At least we know that it is exactly the same value duplicated and not some 
other value.


-- Jack Krupansky

-Original Message- 
From: Robert Krüger

Sent: Thursday, June 06, 2013 7:15 AM
To: solr-user@lucene.apache.org
Subject: Re: copyField generates multiple values encountered for non 
multiValued field


On Wed, Jun 5, 2013 at 9:12 PM, Jack Krupansky j...@basetechnology.com 
wrote:

Look in the Solr log - the error message should tell you what the multiple
values are. For example,

95484 [qtp2998209-11] ERROR org.apache.solr.core.SolrCore  –
org.apache.solr.common.SolrException: ERROR: [doc=doc-1] multiple values
encountered for non multiValued field content_s: [def, abc]

One of the values should be the value of the field that is the source of 
the

copyField. Maybe the other value will give you a clue as to where it came
from.

Check your SolrJ code - maybe you actually do try to initialize a value in
the field that is the copyField target.


I see the values in the stack trace:

org.apache.solr.common.SolrException: ERROR:
[doc=8f60d040-3462-4b28-998f-fd05a64f1cd8:/] multiple values
encountered for non multiValued field name2: [rename, rename]

It is just twice the value of source-field and I am not referencing
that field in my java code. 



Re: Heap space problem with mlt query

2013-06-06 Thread Erick Erickson
Your cache sizes are still much too large. I wouldn't expect
the changes you outlined to change anything. And your
autowarm sizes are still far too big. The default sizes are
512 and 0 for size and autowarm counts. Try those. In fact,
Solr will happily function (admittedly with slower queries) if
the cache sizes are 0 so a quick experiment would be to set
them to, say, 128 and 16 and see if the problem
goes away.

Here's a test. Look at the admin cache pages, you can find the cache
statistics, including the number of entries in each. My bet is that
you'll see one of them get to size X (say 5,000) and soon after
hit your OOM.

I've seen this exact scenario play out a little differently. The
caches were very large but during the day indexing happened often
enough to invalidate them so they never grew. Then at night the
indexing would stop but queries kept happening and the caches grew
and BOOM, OOM errors.

Best
Erick





On Thu, Jun 6, 2013 at 6:55 AM, Varsha Rani varsha.ya...@orkash.com wrote:
 Hi Stavros,

 I checked it with batchSize=-1, But still the same issue.


 As my single mlt query is :



 http://machine_ip:8983/solr/News/mlt?q=field1:34358471qt=/mltmlt.match.include=truemlt=truemlt.mindf=1mlt.mintf=1mlt.minwl=3mlt.boost=truefq=cat:News;
 AND date:[136644000 TO 1362827444000]  AND -category:General
 mlt.qf=field2^2mlt.fl=field3mlt.count=30start=0fl=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10sort=score
 descrows=50wt=xmlversion=2.2

 Exception i faced is :


 org.apache.solr.client.solrj.SolrServerException: Error executing query
 at
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:96)
 at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:113)
 at org.orkash.automatic.mltdetector.main(mltdetector.java:212)
 Caused by: org.apache.solr.common.SolrException: Error in
 xpath:/config/luceneMatchVersion for solrconfig.xml
 org.apache.solr.common.SolrException: Error in
 xpath:/config/luceneMatchVersion for solrconfig.xml at
 org.apache.solr.core.Config.getNode(Config.java:197)at
 org.apache.solr.core.Config.getVal(Config.java:202) at
 org.apache.solr.core.Config.getLuceneVersion(Config.java:271)   at
 org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:70)  at
 org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:66)  at
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:68)   
   at
 org.apache.solr.search.QParser.getQuery(QParser.java:143)   at
 org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:95)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298)at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
 at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)  
   at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326) at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)   
   at
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 at org.mortbay.jetty.HttpParser.parseNext(HttpPa

 Error in xpath:/config/luceneMatchVersion for solrconfig.xml
 org.apache.solr.common.SolrException: Error in
 xpath:/config/luceneMatchVersion for solrconfig.xml at
 org.apache.solr.core.Config.getNode(Config.java:197)at
 org.apache.solr.core.Config.getVal(Config.java:202) at
 org.apache.solr.core.Config.getLuceneVersion(Config.java:271)   at
 org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:70)  at
 org.apache.solr.search.SolrQueryParser.init(SolrQueryParser.java:66)  at
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:68)   
   at
 org.apache.solr.search.QParser.getQuery(QParser.java:143)   at
 org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:95)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298)at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340)
 at
 

Auto-Suggest, spell check dictionary replication to slave issue

2013-06-06 Thread msreddy.hi
Hi All,

We create 2 dictionary's from a indexed field for auto-sugest, spell check
feature. When we configured replication from master to slave's index is
replicating properly but not the auto-suggest, spell check dictionary's.

Is there a way to replicate auto-suggest, spell check dictionary outside the
index directory?

Please suggest.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-Suggest-spell-check-dictionary-replication-to-slave-issue-tp4068562.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Jack Krupansky
read current state, manipulate fields and then add the document with the 
same id)


Ahh... then you have an IMPLICIT reference to the field in your Java code - 
you explicitly told Solr that you wanted to start with all existing field 
values. Just because a field is the target of a copyField doesn't make it 
any different from any other field when reading. Although, it does beg the 
question of whether or not this field should be stored or not - that's a 
data modeling question that only you can resolve. Do queries need to 
retrieve this field?


Be sure to null out any values for any fields that are sourced by copy 
fields. Otherwise, yes, duplicated values would be exactly what you should 
expect.


Is there any reason that you can't simply use atomic update - create a new 
document with the same document id but with only set values for the fields 
to be changed? There is also add for multivalued fields.


There isn't great doc for this. Basically, the value for every non-ID field 
would be a Map object (HashMap) with a set key whose value is the new 
field value.


Here's a code fragment for setting one field:

   SolrInputDocument doc2 = new SolrInputDocument();
   MapString,String fpValue2 = new HashMapString, String();
   fpValue2.put(set,fp2);
   doc2.setField(FACTURES_PRODUIT, fpValue2);

You need a separate Map object for each field to be set or added for 
appending to a multivalued field. And you need a simple (non-Map) value for 
your ID field.


-- Jack Krupansky

-Original Message- 
From: Robert Krüger

Sent: Thursday, June 06, 2013 7:25 AM
To: solr-user@lucene.apache.org
Subject: Re: copyField generates multiple values encountered for non 
multiValued field


I don't know what I have to do to use the atomic update feature but I
am not aware of using it. But the way you describe it, it means that
the copyField directive does not overwrite the existing field content
and that's an easy explanation to what is happening in my case. Then
the second update (which I do manually, i.e. read current state,
manipulate fields and then add the document with the same id) will
lead to this. That was not so obvious to me from the docs.

Thanks,

Robert

On Thu, Jun 6, 2013 at 12:18 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:


: I updated the Index using SolrJ and got the exact same error message

there aren't a lot of specifics provided in this thread, so this may not
be applicable, but if you mean you actaully using the atomic updates
feature to update an existing document then the problem is that you still
have the existing value in your name2 field, as well as another copy of
the name field evaluated by copyField after the updates are applied...

http://wiki.apache.org/solr/Atomic_Updates#Stored_Values


-Hoss 




Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Prathik Puthran
Basically I want the Suggester to return for Jason Bourne as suggestion
for .*Bour.* regex.

Thanks,
Prathik


On Thu, Jun 6, 2013 at 12:52 PM, Prathik Puthran 
prathik.puthra...@gmail.com wrote:

 This works even now i.e. when I search for Jas it suggests Jason
 Bourne. What I want is when I search for Bour or ason (any substring)
 it should suggest me Jason Bourne .


 On Thu, Jun 6, 2013 at 12:34 PM, Upayavira u...@odoko.co.uk wrote:

 Can you se the ShingleFilterFactory? It is ngrams for terms rather than
 characters. If you limited it to two term ngrams, when the user presses
 space after their first word, you could do a suggested query against
 your two term ngram field, which would suggest Jason Bourne, Jason
 Statham, etc then you press space after Jason.

 Upayavira

 On Thu, Jun 6, 2013, at 07:25 AM, Prathik Puthran wrote:
  My use case is I want to search for any substring of the indexed string
  and
  the Suggester should suggest the indexed string. What can I do to make
  this
  work?
 
  Thanks,
  Prathik
 
 
  On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev
  mkhlud...@griddynamics.com
   wrote:
 
   Please excuse my misunderstanding, but I always wonder why this index
 time
   processing is suggested usually. from my POV is the case for
 query-time
   processing i.e. PrefixQuery aka wildcard query Jason* .
   Ultra-fast term retrieval also provided by TermsComponent.
  
  
   On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky 
 j...@basetechnology.com
   wrote:
  
ngrams?
   
See:
http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
apache/lucene/analysis/ngram/**NGramFilterFactory.html
  
 http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html
   
   
-- Jack Krupansky
   
-Original Message- From: Prathik Puthran
Sent: Wednesday, June 05, 2013 11:59 AM
To: solr-user@lucene.apache.org
Subject: Configuring lucene to suggest the indexed string for all
 the
searches of the substring of the indexed string
   
   
Hi,
   
Is it possible to configure solr to suggest the indexed string for
 all
   the
searches of the substring of the string?
   
Thanks,
Prathik
   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
mkhlud...@griddynamics.com
  





AW: Heap space problem with mlt query

2013-06-06 Thread André Widhani
I am just reading through this thread by chance, but doesn't this exception:

 Caused by: org.apache.solr.common.SolrException: Error in
 xpath:/config/luceneMatchVersion for solrconfig.xml 
 org.apache.solr.common.SolrException: Error in
 xpath:/config/luceneMatchVersion for solrconfig.xml  at

indicate some missing or wrong information in solrconfig.xml, specifically the 
luceneMatchVersion field?

Sorry for the confusion if I am on a wrong track here.

André



Re: Solr: separating index and storage

2013-06-06 Thread Sourajit Basak
Each day the index grows by ~250 MB; however I am anticipating that this
growth will slow down because there will be repetitions (just a guess). Its
not the order of growth but limitation of our infrastructure. Basically a
budgetary constraint :-)

Apparently there seems to be no problem than disk space. So we will go
ahead with the idea of stored fields.




On Thu, Jun 6, 2013 at 5:03 PM, Erick Erickson erickerick...@gmail.comwrote:

 By and large, stored fields are pretty irrelevant for resource
 consumption _except_ for
 disk space consumed. Sharded systems work fine, the
 stored data is stored in the index files (*.fdt and *.fdx) files in
 each segment on each shard.

 But you haven't told us anything about your data. How much are
 you talking about here? 100s of G? Terabytes? Other than disk
 space, You may well be anticipating problems that don't exist...

 Now, when _returning_ documents the fields must be read, so
 there is some resource consumption there which you can
 mitigate with lazy field loading. But this is usually just a few docs
 so often isn't a problem.

 Best
 Erick

 On Thu, Jun 6, 2013 at 3:34 AM, Sourajit Basak sourajit.ba...@gmail.com
 wrote:
  Absolutely. Solr will return the reference along the docs/results; those
  references may be used to look-up the actual stuff. Such use cases aren't
  hard to solve.
 
  If the use case demands returning the actual stuff alongside the results,
  it becomes non-trivial, especially during high loads.
 
  To avoid this and do a quick implementation I can judiciously create
 stored
  fields and see how it performs. I will need to figure out what happens if
  the volume growth of stored fields is high, how much is the disk I/O and
  what happens if we shard the index, like, what happens to the stored
 fields
  then.
 
  Best,
  Sourajit
 
 
 
 
  On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  You have to index something with your Solr documents that
  has meaning in _your_ system so you can find the
  original record. You don't search this field, you just
  return it with the search results and then use it to get
  the original document.
 
  If you're storing the original in a DB, this can be the PK.
  If on a file system the path. etc.
 
  Essentially, since the association is specific to your environment
  you need to handle it explicitly...
 
  Best
  Erick
 
  On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak
  sourajit.ba...@gmail.com wrote:
   Consider the following use case.
  
   Certain words are extracted from a document and indexed. The exact
  sentence
   containing the word cannot be stored alongside the extracted word
 because
   of the volume at which the documents grow; How can the index and, lets
  call
   it doc servers be separated ?
  
   An option is to store the sentences in MongoDB or a RDBMS. But there
  seems
   to be a schema level design issue. Assuming 'word' to be a multivalued
   field, how do we associate to it a reference to the corresponding
 entry
  in
   the doc server.
  
   May create (word_1, ref_1) tuples. Is there any other in-built
 feature ?
  
   Any related project which separates index  doc servers ?
  
   Thanks,
   Sourajit
 



Re: Schema Change: Int - String

2013-06-06 Thread Jack Krupansky
1. Generally, any schema change requires a full reindex. Sure, a lot of 
times you can squeak by, but with Solr and Lucene there are no guarantees. 
If it works for you, great. If not, don't complain - just reindex. And even 
if it does work for the current release, there is no guarantee that a 
similar change in a future release might not require a reindex.


2. Make up you mind whether a field is a number or a string, and stick with 
that import format.


General rule: clean up your data before you send it to Solr. But... you can 
do some amount of cleanup using update processors, including white space 
trimming and limited regex editing. You can also develop custom update 
processors, as well as write in scripting languages such as JavaScript. For 
example, you could parse a string of numbers and then send them to other 
fields.


3. Too hard to say from the way you have described it. Show us some sample 
input.


In general, TextField is for text, not numbers. If you intend to query data 
as numbers, don't use Text field.


-- Jack Krupansky

-Original Message- 
From: TwoFirst TwoLast

Sent: Thursday, June 06, 2013 1:25 AM
To: solr-user@lucene.apache.org
Subject: Schema Change: Int - String

1) If I change one field's type in my schema, will that cause problems with
the index or searching?  My data is pulled in chunks off of a mysql server
so one field in the currently indexed data is simply an int type field in
solr.  I would like to change this to a string moving forward, but still
expect to search across the int/string field.  Will this be ok?

2) My motivation for #1 is that I have thousands of records that are
exactly the same in mysql aside from a user_id column.  Prior to inserting
into mysql I am thinking that I can concatenate the user_ids together into
a space separated string and let solr just parse the string.  So the
database and my data import handler would change a bit.

3) If #2 is an appropriate approach, will a solr.TextField with
a solr.WhitespaceTokenizerFactory be an ok way to approach this?  This does
produce words where I would expect integers. I tried using a
solr.TrieIntField with the solr.WhitespaceTokenizerFactory, but it throws
an error.

Finally I need to make sure that exact matches will be performed on
user_ids in the string when searching.

Much appreciated! 



Re: Filtering on results with more than N words.

2013-06-06 Thread Walter Underwood
Someone else asked about this recently. The best approach is to count the words 
at index time and add a field with the count, so title and title_len or 
something like that.

wunder

On Jun 6, 2013, at 4:20 AM, Jack Krupansky wrote:

 I don't recall seeing any such filter. Sounds like a good idea though. 
 Although, maybe it is another good idea that really isn't too necessary for 
 solving many real world problems.
 
 -- Jack Krupansky
 
 -Original Message- From: Dotan Cohen
 Sent: Thursday, June 06, 2013 3:45 AM
 To: solr-user@lucene.apache.org
 Subject: Filtering on results with more than N words.
 
 Is there any way to restrict the search results to only those
 documents with more than N words / tokens in the searched field? I
 thought that this would be an easy one to Google for, but I cannot
 figure it out. or find any references. There are many references to
 word size in characters, but not to  filed size in words.
 
 Thank you.
 
 --
 Dotan Cohen
 
 http://gibberish.co.il
 http://what-is-what.com 





Download CSV, Strange thing is happening !!

2013-06-06 Thread anurag.jain
I have two field in solr, Named as 10th_mark, 12th_mark. Now I want to
download that field in csv so i tried,

http://localhost:8983/solr?q=*:*wt=csvstart=0rows=10fl=10th_mark,12th_mark

But output is something like that,

th_mark













But If i put *th_mark it is giving me correct output. But If I Put * then
output comes in Random order, Please give me a way to solve this type of
problem.

Please Reply ASAP,

Thanks 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Download-CSV-Strange-thing-is-happening-tp4068599.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering on results with more than N words.

2013-06-06 Thread Jack Krupansky
Yeah, but part of the problem is that an input string is not converted to 
words until analysis, which doesn't happen until after Solr creates the 
Lucene Document and hands it off to Lucene. In other words (Ha!Ha!), there 
are no words during the Solr-side of indexing. That said, you can always 
fake it by writing a JavaScript StatelessScriptUpdateProcessorFactory script 
that simulates basic tokenization, like converting punctuation to white 
space,  trimming and eliminating excess white space and then doing a split 
and count the results. Or, we could add a new update processor that did 
exactly that - CountWordsUpdateProcessorFactory. Much like 
FieldLengthUpdateProcessorFactory... maybe it could be an option on FLUPF - 
count=words/chars.


-- Jack Krupansky

-Original Message- 
From: Walter Underwood

Sent: Thursday, June 06, 2013 9:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Filtering on results with more than N words.

Someone else asked about this recently. The best approach is to count the 
words at index time and add a field with the count, so title and 
title_len or something like that.


wunder

On Jun 6, 2013, at 4:20 AM, Jack Krupansky wrote:

I don't recall seeing any such filter. Sounds like a good idea though. 
Although, maybe it is another good idea that really isn't too necessary 
for solving many real world problems.


-- Jack Krupansky

-Original Message- From: Dotan Cohen
Sent: Thursday, June 06, 2013 3:45 AM
To: solr-user@lucene.apache.org
Subject: Filtering on results with more than N words.

Is there any way to restrict the search results to only those
documents with more than N words / tokens in the searched field? I
thought that this would be an easy one to Google for, but I cannot
figure it out. or find any references. There are many references to
word size in characters, but not to  filed size in words.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com






Re: Download CSV, Strange thing is happening !!

2013-06-06 Thread Raymond Wiker
I think you'd be better off using field names that look like Java
identifiers - e.g, mark10 instead of 10th_mark.

Actually, let me rephrase that: you SHOULD be using field names that look
like Java identifiers - less headache, all round.


On Thu, Jun 6, 2013 at 4:01 PM, anurag.jain anurag.k...@gmail.com wrote:

 I have two field in solr, Named as 10th_mark, 12th_mark. Now I want to
 download that field in csv so i tried,


 http://localhost:8983/solr?q=*:*wt=csvstart=0rows=10fl=10th_mark,12th_mark

 But output is something like that,

 th_mark
 
 
 
 
 
 
 
 
 
 



 But If i put *th_mark it is giving me correct output. But If I Put * then
 output comes in Random order, Please give me a way to solve this type of
 problem.

 Please Reply ASAP,

 Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Download-CSV-Strange-thing-is-happening-tp4068599.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 4.2.1 higher memory footprint vs Solr 3.5

2013-06-06 Thread Shawn Heisey
On 6/6/2013 3:50 AM, Bernd Fehling wrote:
 What helped me a lot was switching to G1GC.
 Faster, smoother, very little ripple, nearly no sawtooth.

When I tried G1, it did indeed produce a better looking memory graph,
but it didn't do anything about my GC pauses.  They were several seconds
with just CMS and NewRatio, and they actually seemed to get slightly
worse when I tried G1 instead.

To solve the GC pause problem, I've had to switch back to CMS and tack
on several more tuning options, most of which are CMS-specific.  I'm not
sure how to tune G1.  Have you done any additional tuning?

Thanks,
Shawn



Lucene Filter That Will Remove Some Tokens By Regex Pattern?

2013-06-06 Thread Furkan KAMACI
I want to use a core Lucene filter that will remove some tokens defined by
a regex pattern. What is the appropriate class for it?


Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Robert Krüger
On Thu, Jun 6, 2013 at 1:52 PM, Jack Krupansky j...@basetechnology.com wrote:
 read current state, manipulate fields and then add the document with the
 same id)

 Ahh... then you have an IMPLICIT reference to the field in your Java code -
 you explicitly told Solr that you wanted to start with all existing field
 values. Just because a field is the target of a copyField doesn't make it
 any different from any other field when reading. Although, it does beg the
 question of whether or not this field should be stored or not - that's a
 data modeling question that only you can resolve. Do queries need to
 retrieve this field?
you're right. in my concrete use case it does not need to to be stored.



 Be sure to null out any values for any fields that are sourced by copy
 fields. Otherwise, yes, duplicated values would be exactly what you should
 expect.
yes, I will do that.


 Is there any reason that you can't simply use atomic update - create a new
 document with the same document id but with only set values for the fields
 to be changed? There is also add for multivalued fields.

 There isn't great doc for this. Basically, the value for every non-ID field
 would be a Map object (HashMap) with a set key whose value is the new
 field value.

 Here's a code fragment for setting one field:

SolrInputDocument doc2 = new SolrInputDocument();
MapString,String fpValue2 = new HashMapString, String();
fpValue2.put(set,fp2);
doc2.setField(FACTURES_PRODUIT, fpValue2);

 You need a separate Map object for each field to be set or added for
 appending to a multivalued field. And you need a simple (non-Map) value for
 your ID field.

thanks for the info! the code is a lot older than solr 4.0, so that
option was not available at the time of its writing. I will check if
it makes sense to use that feature. most likely yes.

Robert


Re: Lucene Filter That Will Remove Some Tokens By Regex Pattern?

2013-06-06 Thread Walter Underwood
On Jun 6, 2013, at 7:24 AM, Furkan KAMACI wrote:

 I want to use a core Lucene filter that will remove some tokens defined by
 a regex pattern. What is the appropriate class for it?

Use a pattern replace filter. That will give you zero-length tokens, which can 
cause odd matches. Follow it with a length filter to remove those.

filter class=solr.PatternReplaceFilterFactory
pattern=.* replacement= replace=all/
filter class=solr.LengthFilterFactory min=1 max=1024/

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: Download CSV, Strange thing is happening !!

2013-06-06 Thread Jack Krupansky

Yeah, Java-like identifiers are best.

You should be able to wrap non-Java names in a field function:

fl=field(10th_mark),field(12th_mark)

-- Jack Krupansky

-Original Message- 
From: Raymond Wiker

Sent: Thursday, June 06, 2013 10:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Download CSV, Strange thing is happening !!

I think you'd be better off using field names that look like Java
identifiers - e.g, mark10 instead of 10th_mark.

Actually, let me rephrase that: you SHOULD be using field names that look
like Java identifiers - less headache, all round.


On Thu, Jun 6, 2013 at 4:01 PM, anurag.jain anurag.k...@gmail.com wrote:


I have two field in solr, Named as 10th_mark, 12th_mark. Now I want to
download that field in csv so i tried,


http://localhost:8983/solr?q=*:*wt=csvstart=0rows=10fl=10th_mark,12th_mark

But output is something like that,

th_mark













But If i put *th_mark it is giving me correct output. But If I Put * then
output comes in Random order, Please give me a way to solve this type of
problem.

Please Reply ASAP,

Thanks



--
View this message in context:
http://lucene.472066.n3.nabble.com/Download-CSV-Strange-thing-is-happening-tp4068599.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Mikhail Khludnev
Got it. It's actually contrast to usual prefix suggestions.
So, out-of-the box it's provided by
http://wiki.apache.org/solr/TermsComponent terms.regex= also see last
example there
it should works by loading terms in memory and linearly scanning them with
regexp.
There is nothing more efficient out-of-the box.
http://wiki.apache.org/solr/Suggester says Support for infix-suggestions
_is planned_ for FSTLookup (which would be the only structure to support
these).


On Thu, Jun 6, 2013 at 10:25 AM, Prathik Puthran 
prathik.puthra...@gmail.com wrote:

 My use case is I want to search for any substring of the indexed string and
 the Suggester should suggest the indexed string. What can I do to make this
 work?

 Thanks,
 Prathik


 On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:

  Please excuse my misunderstanding, but I always wonder why this index
 time
  processing is suggested usually. from my POV is the case for query-time
  processing i.e. PrefixQuery aka wildcard query Jason* .
  Ultra-fast term retrieval also provided by TermsComponent.
 
 
  On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky j...@basetechnology.com
  wrote:
 
   ngrams?
  
   See:
   http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
   apache/lucene/analysis/ngram/**NGramFilterFactory.html
 
 http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html
  
  
   -- Jack Krupansky
  
   -Original Message- From: Prathik Puthran
   Sent: Wednesday, June 05, 2013 11:59 AM
   To: solr-user@lucene.apache.org
   Subject: Configuring lucene to suggest the indexed string for all the
   searches of the substring of the indexed string
  
  
   Hi,
  
   Is it possible to configure solr to suggest the indexed string for all
  the
   searches of the substring of the string?
  
   Thanks,
   Prathik
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Walter Underwood
Let's clear up some things about how Solr works.

1. Solr matches individual words, not the whole text. So Jason Bourne is 
split into [Jason, Bourne]. The leading .* in your pattern does not match 
preceding words, it would match the beginning of a single word.

2. Query time wildcards test every word in the index. This might be a billion 
words. Of course that is slow. This is why we try to do things at index time. 
With ngrams, there is one lookup, not a billion wildcard matches.

3. Regexes will almost always be the slowest way to do something in Solr, and 
are almost always too slow for production.

Now, what are you trying to do for the user? It seems like you have decided on 
a solution and are asking about that.

Solr already has many built-in solutions, so if we know the root problem, we 
may find an easy solution.

wunder

On Jun 6, 2013, at 4:53 AM, Prathik Puthran wrote:

 Basically I want the Suggester to return for Jason Bourne as suggestion
 for .*Bour.* regex.
 
 Thanks,
 Prathik
 
 
 On Thu, Jun 6, 2013 at 12:52 PM, Prathik Puthran 
 prathik.puthra...@gmail.com wrote:
 
 This works even now i.e. when I search for Jas it suggests Jason
 Bourne. What I want is when I search for Bour or ason (any substring)
 it should suggest me Jason Bourne .
 
 
 On Thu, Jun 6, 2013 at 12:34 PM, Upayavira u...@odoko.co.uk wrote:
 
 Can you se the ShingleFilterFactory? It is ngrams for terms rather than
 characters. If you limited it to two term ngrams, when the user presses
 space after their first word, you could do a suggested query against
 your two term ngram field, which would suggest Jason Bourne, Jason
 Statham, etc then you press space after Jason.
 
 Upayavira
 
 On Thu, Jun 6, 2013, at 07:25 AM, Prathik Puthran wrote:
 My use case is I want to search for any substring of the indexed string
 and
 the Suggester should suggest the indexed string. What can I do to make
 this
 work?
 
 Thanks,
 Prathik
 
 
 On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev
 mkhlud...@griddynamics.com
 wrote:
 
 Please excuse my misunderstanding, but I always wonder why this index
 time
 processing is suggested usually. from my POV is the case for
 query-time
 processing i.e. PrefixQuery aka wildcard query Jason* .
 Ultra-fast term retrieval also provided by TermsComponent.
 
 
 On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky 
 j...@basetechnology.com
 wrote:
 
 ngrams?
 
 See:
 http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
 apache/lucene/analysis/ngram/**NGramFilterFactory.html
 
 http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html
 
 
 -- Jack Krupansky
 
 -Original Message- From: Prathik Puthran
 Sent: Wednesday, June 05, 2013 11:59 AM
 To: solr-user@lucene.apache.org
 Subject: Configuring lucene to suggest the indexed string for all
 the
 searches of the substring of the indexed string
 
 
 Hi,
 
 Is it possible to configure solr to suggest the indexed string for
 all
 the
 searches of the substring of the string?
 
 Thanks,
 Prathik
 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com
 
 






Re: Download CSV, Strange thing is happening !!

2013-06-06 Thread anurag.jain
fl=field(10th_mark),field(12th_mark) 

if I use wt=csv, It is giving me No output, when wt=json it is giving me
output. 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Download-CSV-Strange-thing-is-happening-tp4068599p4068633.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: data-import problem

2013-06-06 Thread bbarani
The below error clearly says that you have declared a unique id but that
unique id is missing for some documents.

org.apache.solr.common.SolrException: [doc=null] missing required field:
nameid

This is mainly because you are just trying to import 2 tables in to a
document without any relationship between the data of 2 tables.

table 1 has the nameid (unique key) but table 2 has to be joined with table
1 to form a relationship between the 2 tables. You can't just dump the value
since table 2 might have more values than table1 (but table1 has the unique
id).

I am not sure of your table structure, I am assuming that there is a key
(ex: nameid in title table) that can be used to join name and title table.

Try something like this..

  document
entity name=name query=SELECT id, name FROM name LIMIT 10
field column=id name=nameid /
field column=name name=name /
/entity
*entity name=title query=SELECT id, title FROM title where
nameid=${name.id}
*field column=id name=titleid /
field column=title name=title /
/entity
  /document
/dataConfig



--
View this message in context: 
http://lucene.472066.n3.nabble.com/data-import-problem-tp4068345p4068636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Group by multiple fields

2013-06-06 Thread bbarani
Not sure if this solution will work for you but this is what I did to
implement nested grouping using SOLR 3.X.

Simple idea behind is to Concatenate 2 fields and index them in to single
field and group on that field..

http://stackoverflow.com/questions/12202023/field-collapsing-grouping-how-to-make-solr-return-intersection-of-2-resultse



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Group-by-multiple-fields-tp4068518p4068638.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering on results with more than N words.

2013-06-06 Thread Walter Underwood
I was thinking of counting the words before the field is indexed. It is quite 
possible that splitting on white space would be sufficient.

Of course, some idea of what problem this is supposed to solve would be very 
helpful.

wunder

On Jun 6, 2013, at 7:07 AM, Jack Krupansky wrote:

 Yeah, but part of the problem is that an input string is not converted to 
 words until analysis, which doesn't happen until after Solr creates the 
 Lucene Document and hands it off to Lucene. In other words (Ha!Ha!), there 
 are no words during the Solr-side of indexing. That said, you can always fake 
 it by writing a JavaScript StatelessScriptUpdateProcessorFactory script that 
 simulates basic tokenization, like converting punctuation to white space,  
 trimming and eliminating excess white space and then doing a split and count 
 the results. Or, we could add a new update processor that did exactly that - 
 CountWordsUpdateProcessorFactory. Much like 
 FieldLengthUpdateProcessorFactory... maybe it could be an option on FLUPF - 
 count=words/chars.
 
 -- Jack Krupansky
 
 -Original Message- From: Walter Underwood
 Sent: Thursday, June 06, 2013 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Filtering on results with more than N words.
 
 Someone else asked about this recently. The best approach is to count the 
 words at index time and add a field with the count, so title and 
 title_len or something like that.
 
 wunder
 
 On Jun 6, 2013, at 4:20 AM, Jack Krupansky wrote:
 
 I don't recall seeing any such filter. Sounds like a good idea though. 
 Although, maybe it is another good idea that really isn't too necessary for 
 solving many real world problems.
 
 -- Jack Krupansky
 
 -Original Message- From: Dotan Cohen
 Sent: Thursday, June 06, 2013 3:45 AM
 To: solr-user@lucene.apache.org
 Subject: Filtering on results with more than N words.
 
 Is there any way to restrict the search results to only those
 documents with more than N words / tokens in the searched field? I
 thought that this would be an easy one to Google for, but I cannot
 figure it out. or find any references. There are many references to
 word size in characters, but not to  filed size in words.
 
 Thank you.
 
 --
 Dotan Cohen
 
 http://gibberish.co.il
 http://what-is-what.com
 
 
 

--
Walter Underwood
wun...@wunderwood.org





Re: data-import problem

2013-06-06 Thread Stavros Delisavas
It's surprising to me that all tables have to have a relationship in 
order to be used in solr. What if I have two indipendent projects 
running on the same webserver? I would not be able to use Solr for both 
of them, really? That would be very dissappointing...


Anyway, luckily there is an indirect relationship between the two tables 
but there is an N to N relationship with a thrid table in between. The 
full join in MySQL would be something like this:


SELECT (cast.id??), title.id, title.title, name.id, name.name
FROM name, title, cast
WHERE title.id = cast.movie_id
AND cast.person_id = name.id

But this will definatly lead to multiple entries of name.name and 
title.title because they are connected with an N-to-N relationship. So 
the resulting table would not have unique keys either!! Nor title.id or 
name.id. There is another id available cast.id which could be used as a 
unique id, but its a completly useless and irrelevant id which has no 
connection/relation to anything else at all. So there is no real use for 
it to include it, unless Solr really needs a unique id.


I am still a noob with Solr. Can you please help me to adapt the given 
Join to the xml-syntax for my data-config.xml?

That would be very great!


Am 06.06.2013 17:58, schrieb bbarani:

The below error clearly says that you have declared a unique id but that
unique id is missing for some documents.

org.apache.solr.common.SolrException: [doc=null] missing required field:
nameid

This is mainly because you are just trying to import 2 tables in to a
document without any relationship between the data of 2 tables.

table 1 has the nameid (unique key) but table 2 has to be joined with table
1 to form a relationship between the 2 tables. You can't just dump the value
since table 2 might have more values than table1 (but table1 has the unique
id).

I am not sure of your table structure, I am assuming that there is a key
(ex: nameid in title table) that can be used to join name and title table.

Try something like this..

   document
 entity name=name query=SELECT id, name FROM name LIMIT 10
 field column=id name=nameid /
 field column=name name=name /
 /entity
*entity name=title query=SELECT id, title FROM title where
nameid=${name.id}
*field column=id name=titleid /
 field column=title name=title /
 /entity
   /document
/dataConfig



--
View this message in context: 
http://lucene.472066.n3.nabble.com/data-import-problem-tp4068345p4068636.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Solr indexing slows down

2013-06-06 Thread Michael Della Bitta
Hi Sebastian,

What database are you using? How much RAM is available on your machine? It
looks like you're selecting from a view... Have you tried paging through
the view outside of Solr? Does that slow down as well? Do you notice any
increased load on the Solr box or the database server?



Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Thu, Jun 6, 2013 at 6:13 AM, Sebastian Steinfeld 
sebastian.steinf...@mgm-tp.com wrote:

 Hi,

 I am new to solr and we want to use Solr to speed up our product search.
 And it is working really nice, but I think I have a problem with the
 indexing.
 It slows down after a few minutes.

 I am using the DataImportHandler to import the products from the database.
 And I start the import by executing the following HTTP request:
 /dataimport?command=full-importclean=truecommit=true

 I guess this are the importend parts of my configuration:

 schema.xml:
 --
 fields
field name=pk   type=longindexed=true
  stored=true required=true  /
field name=code type=string  indexed=true
  stored=true required=true  /
field name=ean  type=string  indexed=true
  stored=false  /
field name=name type=lowercase   indexed=true
  stored=false  /
field name=text type=text_general indexed=true stored=false
 multiValued=true/
field name=_version_ type=long indexed=true stored=true/
 /fields
 
 fieldType name=lowercase class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
   /analyzer
 /fieldType
 --

 solrconfig.xml:
 --
   requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
 str name=configdataimport-handler.xml/str
 /lst
   /requestHandler
 --

 dataimport-handler.xml:
 --
 dataConfig
 dataSource name=local driver==* 
 url=*
 user=* 
 password=*
 /
document
 entity name=product pk=PRODUCTS_PK dataSource=local
 query=SELECT   PRODUCTS_PK, PRODUCTS_CODE,
 PRODUCTS_EAN, PRODUCTSLP_NAME FROM V_SOLR_IMPORT4PRODUCT_SEARCH
 field column=PRODUCTS_PK   name=pk /
 field column=PRODUCTS_CODE name=code /
 field column=PRODUCTS_EAN  name=ean /
 field column=PRODUCTSLP_NAME   name=name /
 /entity
 /document
 /dataConfig
 --

 The amout of documents I want to index is 8 million, the first 1,6 million
 are indexed in 2min, but to complete the Import it takes nearly 2 hours.
 The size of the index on the hard drive is 610MB.
 I started the solr server with 2GB memory.


 I read that the duration of indexing might be connected to the batch size,
 so I increased the batchSize in the dataSource to 10.000, but this didn't
 make any differences.
 I also tried to disable the autocommit, which is configured in the
 solrconfig.xml. I disabled it by uncommenting it, but this also didn't made
 any differences.

 It would be realy nice if someone of you could help me with this problem.

 Thank you very much,
 Sebastian




Re: data-import problem

2013-06-06 Thread Walter Underwood
When designing for Solr (or most search engines), think in terms of documents, 
not tables.

What do your search results look like? You will want one document for each 
search result. The document will have stored fields for each thing displayed 
and indexed fields for each thing searched.

If you are starting with a relational database, think about a view that will 
have one row per document. Denormalize as much as you need in order to get that.

For implementation, it might be a view, or an index time query, but the concept 
of a view may help you design.

wunder

On Jun 6, 2013, at 9:24 AM, Stavros Delisavas wrote:

 It's surprising to me that all tables have to have a relationship in order to 
 be used in solr. What if I have two indipendent projects running on the same 
 webserver? I would not be able to use Solr for both of them, really? That 
 would be very dissappointing...
 
 Anyway, luckily there is an indirect relationship between the two tables but 
 there is an N to N relationship with a thrid table in between. The full 
 join in MySQL would be something like this:
 
 SELECT (cast.id??), title.id, title.title, name.id, name.name
 FROM name, title, cast
 WHERE title.id = cast.movie_id
 AND cast.person_id = name.id
 
 But this will definatly lead to multiple entries of name.name and title.title 
 because they are connected with an N-to-N relationship. So the resulting 
 table would not have unique keys either!! Nor title.id or name.id. There is 
 another id available cast.id which could be used as a unique id, but its a 
 completly useless and irrelevant id which has no connection/relation to 
 anything else at all. So there is no real use for it to include it, unless 
 Solr really needs a unique id.
 
 I am still a noob with Solr. Can you please help me to adapt the given Join 
 to the xml-syntax for my data-config.xml?
 That would be very great!
 
 
 Am 06.06.2013 17:58, schrieb bbarani:
 The below error clearly says that you have declared a unique id but that
 unique id is missing for some documents.
 
 org.apache.solr.common.SolrException: [doc=null] missing required field:
 nameid
 
 This is mainly because you are just trying to import 2 tables in to a
 document without any relationship between the data of 2 tables.
 
 table 1 has the nameid (unique key) but table 2 has to be joined with table
 1 to form a relationship between the 2 tables. You can't just dump the value
 since table 2 might have more values than table1 (but table1 has the unique
 id).
 
 I am not sure of your table structure, I am assuming that there is a key
 (ex: nameid in title table) that can be used to join name and title table.
 
 Try something like this..
 
   document
 entity name=name query=SELECT id, name FROM name LIMIT 10
 field column=id name=nameid /
 field column=name name=name /
 /entity
 *entity name=title query=SELECT id, title FROM title where
 nameid=${name.id}
 *field column=id name=titleid /
 field column=title name=title /
 /entity
   /document
 /dataConfig
 






Re: Images in the Solr Wiki

2013-06-06 Thread Chris Hostetter
:  Request to infra filed...
: 
:  https://issues.apache.org/jira/browse/INFRA-6345

FYI: Fixed.


-Hoss


Re: data-import problem

2013-06-06 Thread bbarani
You don't really need to have a relationship but the unique id should be
unique in a document. I had mentioned about the relationship due to the fact
that the unique key was present only in one table but not the other..

Check out this link for more information on importing multiple table data.

http://lucene.472066.n3.nabble.com/Create-index-on-few-unrelated-table-in-Solr-td4068054.html



--
View this message in context: 
http://lucene.472066.n3.nabble.com/data-import-problem-tp4068345p4068650.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr indexing slows down

2013-06-06 Thread Shawn Heisey

On 6/6/2013 4:13 AM, Sebastian Steinfeld wrote:

The amout of documents I want to index is 8 million, the first 1,6 million are 
indexed in 2min, but to complete the Import it takes nearly 2 hours.
The size of the index on the hard drive is 610MB.
I started the solr server with 2GB memory.

I read that the duration of indexing might be connected to the batch size, so I 
increased the batchSize in the dataSource to 10.000, but this didn't make any 
differences.
I also tried to disable the autocommit, which is configured in the 
solrconfig.xml. I disabled it by uncommenting it, but this also didn't made any 
differences.


If you are importing from MySQL, you actually want the batchSize to be 
-1.  This streams the results so they don't take up large blocks of 
memory.  Other JDBC drivers have different ways of configuring this mode 
of operation.  You fully redacted the driver and URL in your config 
file, so I don't know what you are using.


2GB of Java heap for Solr is probably not enough.  It's likely that once 
your index gets big enough, Solr is starved for memory and has to 
perform constant garbage collections to free up enough for basic 
operation.  I would bet that you also don't have enough free memory for 
the OS to cache the index well:


http://wiki.apache.org/solr/SolrPerformanceProblems

If you are using 4.x with the updateLog turned on, then you want 
autoCommit enabled with openSearcher to be false.  This is covered on 
the wiki page I linked.


Thanks,
Shawn



Re: data-import problem

2013-06-06 Thread Stavros Delisavas
Unfortunatly my two tables do not share a unique key. they both have 
integers as keys starting with number 1. Is there any way to overcome 
this problem? Removing the uniquekey-property from my schema.xml leads 
to solr not working (I have tryed that already).
The link you provided is showing what I have already tryed before which 
was leading to my current problem. When I setup my data-config as shown 
in that thread, my second table does not get recorded because of the 
missing field (name.id/nameid the unique key) in my title-table...



Am 06.06.2013 18:32, schrieb bbarani:

You don't really need to have a relationship but the unique id should be
unique in a document. I had mentioned about the relationship due to the fact
that the unique key was present only in one table but not the other..

Check out this link for more information on importing multiple table data.

http://lucene.472066.n3.nabble.com/Create-index-on-few-unrelated-table-in-Solr-td4068054.html



--
View this message in context: 
http://lucene.472066.n3.nabble.com/data-import-problem-tp4068345p4068650.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Images in the Solr Wiki

2013-06-06 Thread Michael Della Bitta
Thanks a lot for your help!

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Thu, Jun 6, 2013 at 1:02 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : Thanks! Now it looks like thumbs work, but if you click to see the larger
 : attachment, that 403s. But maybe that makes sense?

 Ah crap i didn't notice that ... no, that doesn't really make sense.

 i think i understand the problem, looks like maybe the fix missed a moin
 moin verb, i've re-opened.

 https://issues.apache.org/jira/browse/INFRA-6345




 -Hoss



Re: data-import problem

2013-06-06 Thread Shawn Heisey

On 6/6/2013 11:15 AM, Stavros Delisavas wrote:

Unfortunatly my two tables do not share a unique key. they both have
integers as keys starting with number 1. Is there any way to overcome
this problem? Removing the uniquekey-property from my schema.xml leads
to solr not working (I have tryed that already).
The link you provided is showing what I have already tryed before which
was leading to my current problem. When I setup my data-config as shown
in that thread, my second table does not get recorded because of the
missing field (name.id/nameid the unique key) in my title-table...


Change the id field to a StrField in your schema, and then use something 
like this:


document
entity name=name query=SELECT CONCAT('name-',id) AS id, name 
FROM name/entity
entity name=title query=SELECT CONCAT('title-',id) AS id, title 
FROM title/entity

/document

If these documents have no connection to each other at all, set up 
multiple cores so they are entirely separate indexes.


Thanks,
Shawn



Re: data-import problem

2013-06-06 Thread Stavros Delisavas

Perfect! This finally worked! Shawn, thank you a lot!

How do I set up multiple cores?

Again, thank you so much! I was looking for a solution for days!


Am 06.06.2013 19:23, schrieb Shawn Heisey:

On 6/6/2013 11:15 AM, Stavros Delisavas wrote:

Unfortunatly my two tables do not share a unique key. they both have
integers as keys starting with number 1. Is there any way to overcome
this problem? Removing the uniquekey-property from my schema.xml leads
to solr not working (I have tryed that already).
The link you provided is showing what I have already tryed before which
was leading to my current problem. When I setup my data-config as shown
in that thread, my second table does not get recorded because of the
missing field (name.id/nameid the unique key) in my title-table...


Change the id field to a StrField in your schema, and then use 
something like this:


document
entity name=name query=SELECT CONCAT('name-',id) AS id, name 
FROM name/entity
entity name=title query=SELECT CONCAT('title-',id) AS id, 
title FROM title/entity

/document

If these documents have no connection to each other at all, set up 
multiple cores so they are entirely separate indexes.


Thanks,
Shawn





Re: data-import problem

2013-06-06 Thread Shawn Heisey

On 6/6/2013 11:38 AM, Stavros Delisavas wrote:

Perfect! This finally worked! Shawn, thank you a lot!

How do I set up multiple cores?

Again, thank you so much! I was looking for a solution for days!


Cores are defined in solr.xml - the default example core is named 
collection1.  I am struggling to find documentation for multicore that 
is suitable for a novice.  There is some information on this wiki page, 
but it is geared towards the use of the CoreAdmin API, not multiple 
cores themselves.


wiki.apache.org/solr/CoreAdmin

To access a specific core with query urls, you don't use URLs like 
/solr/select that you might have seen in documentation, you use 
/solr/corename/select or /solr/corename/update instead.


Thanks,
Shawn



Re: data-import problem

2013-06-06 Thread bbarani
Not sure if I understand your situation..I am not sure how would you relate
the data between 2 tables if theres no relationship? You are trying to just
dump random values from 2 tables in to a document?ConsiderTable1: Name 
idpeter   1john2mike   3Table2:Title  TitleIdCEO   
111developer222Officer333Cleaner   444IT  
555Your document will look something like..but Peter is a cleaner and not a
CEO..1peterCEO



--
View this message in context: 
http://lucene.472066.n3.nabble.com/data-import-problem-tp4068345p4068677.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: data-import problem

2013-06-06 Thread Stavros Delisavas

Think about movies and the cast of a movie.
There are movies (title) which have their unique ids. And there are many 
people (name) like the producer, actors, etc which have their unique 
ids. But there are ppl who have been actor in more than one movie. Thats 
why i have a third table which connects those two tables via name.id and 
title.id.


Anyway, I think my problem is satisfactory solved for me. Do you think I 
did something wrong?



Am 06.06.2013 19:45, schrieb bbarani:

Not sure if I understand your situation..I am not sure how would you relate
the data between 2 tables if theres no relationship? You are trying to just
dump random values from 2 tables in to a document?ConsiderTable1: Name
idpeter   1john2mike   3Table2:Title  TitleIdCEO
111developer222Officer333Cleaner   444IT
555Your document will look something like..but Peter is a cleaner and not a
CEO..1peterCEO



--
View this message in context: 
http://lucene.472066.n3.nabble.com/data-import-problem-tp4068345p4068677.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: data-import problem

2013-06-06 Thread Stavros Delisavas
Thats okay. For now, I guess it is okay. Finally I could import all 6.6 
million entries successfully. I am happy.




Am 06.06.2013 19:44, schrieb Shawn Heisey:

On 6/6/2013 11:38 AM, Stavros Delisavas wrote:

Perfect! This finally worked! Shawn, thank you a lot!

How do I set up multiple cores?

Again, thank you so much! I was looking for a solution for days!


Cores are defined in solr.xml - the default example core is named 
collection1.  I am struggling to find documentation for multicore that 
is suitable for a novice.  There is some information on this wiki 
page, but it is geared towards the use of the CoreAdmin API, not 
multiple cores themselves.


wiki.apache.org/solr/CoreAdmin

To access a specific core with query urls, you don't use URLs like 
/solr/select that you might have seen in documentation, you use 
/solr/corename/select or /solr/corename/update instead.


Thanks,
Shawn





OutOfMemory while indexing (PROD environment!)

2013-06-06 Thread Isaac Hebsh
Hi everyone,

My SolrCloud cluster (4.3.0) has came into production a few days ago.
Docs are being indexed into Solr using /update requestHandler, as a POST
request, containing text/xml content-type.

The collection is sharded into 36 pieces, each shard has two replicas.
There are 36 nodes (each node on separate virtual machine), so each node
holds exactly 2 cores.

Each update request contains 100 docs, what means 2-3 docs for each shard.
There are 1-2 such requests every minute. Soft-commit happens every 10
minutes, Hard-commit every 30 minutes, and ramBufferSizeMB=128.

After 48 hours of zero problems, suddenly one shard went down (its both
cores). Log says it's OOM (GC overhead limit exceeded). JVM is set to
Xmx=4G.
I'm pretty sure that some minutes before this incident, JVM memory wasn't
so high (even the max memory usage indicator was below 2G).

Indexing requests did not stop, and started getting HTTP 503 errors (no
server hosting shard). At this time, some other cores started to go down
(l had all of the rainbow colors: Active, Recovering, Down, Recovery Failed
and Gone :).

Then I tried to restart tomcat of the down nodes, but some of them failed
to start, due to the error message: we are not the leader. Only shutting
down the both two cores and starting them gradually, solved the problem,
and the whole cluster came back to green state.

Solr is not yet exposed to users, so no queries have been made at that time
(but maybe some non-heavy auto-warm queries were executed).

I don't think that all of the 4GB were being used for justifiable reasons..
I guess that adding more RAM will not solve the problem, in the long term.

Where should I start my log investigation? (about the OOM itself, and about
the chain accident came after it)

I did a search for previous similar issues. There are a lot, but most of
them talks about very old versions of Solr.

[Versions:
Solr: 4.3.0
Tomcat 7
JVM: Oracle 7 (last, standard, JRE), 64bit.
OS: RedHat 6.3]


OR query with null value and non-null value(s)

2013-06-06 Thread Rahul R
I have recently enabled facet.missing=true in solrconfig.xml which gives
null facet values also. As I understand it, the syntax to do a faceted
search on a null value is something like this:
fq=-price:[* TO *]
So when I want to search on a particular value (for example : 4)  OR null
value, I would expect the syntax to be something like this:
fq=(price:4+OR+(-price:[* TO *]))
But this does not work. After searching around for more, read somewhere
that the right way to achieve this would be:
fq=-(-price:4+AND+price:[*+TO+*])
Now this does work but seems like a very roundabout way. Is there a better
way to achieve this ?

I use solrJ in Solr 3.4.

Thank you.

- Rahul


Re: OR query with null value and non-null value(s)

2013-06-06 Thread Shawn Heisey

On 6/6/2013 12:28 PM, Rahul R wrote:

I have recently enabled facet.missing=true in solrconfig.xml which gives
null facet values also. As I understand it, the syntax to do a faceted
search on a null value is something like this:
fq=-price:[* TO *]
So when I want to search on a particular value (for example : 4)  OR null
value, I would expect the syntax to be something like this:
fq=(price:4+OR+(-price:[* TO *]))
But this does not work. After searching around for more, read somewhere
that the right way to achieve this would be:
fq=-(-price:4+AND+price:[*+TO+*])
Now this does work but seems like a very roundabout way. Is there a better
way to achieve this ?


Pure negative queries don't work -- you have to have results in the 
query before you can subtract.  For some top-level queries, Solr is able 
to detect this situation and fix it internally, but on inner queries you 
must explicitly state your intentions.  It is best if you always use 
'*:* -query' syntax, just to be safe.


fq=(price:4+OR+(*:* -price:[* TO *]))

Thanks,
Shawn



new xslt

2013-06-06 Thread Christopher Gross
In 3.x Solr (and earlier) I was able to create a new xslt doc in the
conf/xslt directory and immediately start using it.

In my 4.1 setup, I have:
  queryResponseWriter name=xslt class=solr.XSLTResponseWriter
int name=xsltCacheLifetimeSeconds5/int
  /queryResponseWriter

But after that small wait I still can't use it.  Is there another setting
that I'm missing somewhere?  I am using SolrCloud, do I need to have
zookeeper push that change out?

Thanks!

-- Chris


LotsOfCores feature

2013-06-06 Thread Aleksey
I was looking at this wiki and linked issues:
http://wiki.apache.org/solr/LotsOfCores

they talk about a limit being 100K cores. Is that per server or per
entire fleet because zookeeper needs to manage that?

I was considering a use case where I have tens of millions of indices
but less that a million needs to be active at any time, so they need
to be loaded on demand and evicted when not used for a while.
Also since number one requirement is efficient loading of course I
assume I will store a prebuilt index somewhere so Solr will just
download it and strap it in, right?

The root issue is marked as won;t fix but some other important
subissues are marked as resolved. What's the overall status of the
effort?

Thank you in advance,

Aleksey


Re: Solr: separating index and storage

2013-06-06 Thread Erick Erickson
bq: I am anticipating that this growth will slow down because there
will be repetitions

This will be true for your indexed data, but NOT for your stored data.
Each stored
field is stored as-is per document. It'll be compressed, so won't take
up the entire
250M, but it'll still be stored.

FWIW,
Erick

On Thu, Jun 6, 2013 at 8:02 AM, Sourajit Basak sourajit.ba...@gmail.com wrote:
 Each day the index grows by ~250 MB; however I am anticipating that this
 growth will slow down because there will be repetitions (just a guess). Its
 not the order of growth but limitation of our infrastructure. Basically a
 budgetary constraint :-)

 Apparently there seems to be no problem than disk space. So we will go
 ahead with the idea of stored fields.




 On Thu, Jun 6, 2013 at 5:03 PM, Erick Erickson erickerick...@gmail.comwrote:

 By and large, stored fields are pretty irrelevant for resource
 consumption _except_ for
 disk space consumed. Sharded systems work fine, the
 stored data is stored in the index files (*.fdt and *.fdx) files in
 each segment on each shard.

 But you haven't told us anything about your data. How much are
 you talking about here? 100s of G? Terabytes? Other than disk
 space, You may well be anticipating problems that don't exist...

 Now, when _returning_ documents the fields must be read, so
 there is some resource consumption there which you can
 mitigate with lazy field loading. But this is usually just a few docs
 so often isn't a problem.

 Best
 Erick

 On Thu, Jun 6, 2013 at 3:34 AM, Sourajit Basak sourajit.ba...@gmail.com
 wrote:
  Absolutely. Solr will return the reference along the docs/results; those
  references may be used to look-up the actual stuff. Such use cases aren't
  hard to solve.
 
  If the use case demands returning the actual stuff alongside the results,
  it becomes non-trivial, especially during high loads.
 
  To avoid this and do a quick implementation I can judiciously create
 stored
  fields and see how it performs. I will need to figure out what happens if
  the volume growth of stored fields is high, how much is the disk I/O and
  what happens if we shard the index, like, what happens to the stored
 fields
  then.
 
  Best,
  Sourajit
 
 
 
 
  On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  You have to index something with your Solr documents that
  has meaning in _your_ system so you can find the
  original record. You don't search this field, you just
  return it with the search results and then use it to get
  the original document.
 
  If you're storing the original in a DB, this can be the PK.
  If on a file system the path. etc.
 
  Essentially, since the association is specific to your environment
  you need to handle it explicitly...
 
  Best
  Erick
 
  On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak
  sourajit.ba...@gmail.com wrote:
   Consider the following use case.
  
   Certain words are extracted from a document and indexed. The exact
  sentence
   containing the word cannot be stored alongside the extracted word
 because
   of the volume at which the documents grow; How can the index and, lets
  call
   it doc servers be separated ?
  
   An option is to store the sentences in MongoDB or a RDBMS. But there
  seems
   to be a schema level design issue. Assuming 'word' to be a multivalued
   field, how do we associate to it a reference to the corresponding
 entry
  in
   the doc server.
  
   May create (word_1, ref_1) tuples. Is there any other in-built
 feature ?
  
   Any related project which separates index  doc servers ?
  
   Thanks,
   Sourajit
 



Re: LotsOfCores feature

2013-06-06 Thread Erick Erickson
100K is really not the limit, it's just hard to imagine
100K cores on a single machine unless some were
really rarely used. And it's per node, not cluster-wide.

The current state is that everything is in place, including
transient cores, auto-discovery, etc. So you should be
able to go ahead and try it out.

The next bit that will help with efficiency is sharing named
config sets. The intent here is that solrhome/configs will
contain sub-dirs like conf1, conf2 etc. Then your cores
can reference configName=conf1 and only one copy of
the configuration data will be used rather than re-loading one
for each core as it comes up and down.

Do note that the _first_ query in to one of the not-yet-loaded
cores will be slow. The model here is that you can tolerate
some queries taking more time at first than you might like
in exchange for the hardware savings. This pre-supposes that
you simply cannot fit all the cores into memory at once.

The won't fix bits are there because, as we got farther into this
process, the approach changed and the functionality of the
won't fix JIRAs was subsumed by other changes by and large.

I've got to update that documentation sometime, but just haven't
had time yet. If you go down this route, we'll be happy to
add your name to the authorized editors of the wiki list if you'd
like.

Best
Erick

On Thu, Jun 6, 2013 at 3:08 PM, Aleksey bitterc...@gmail.com wrote:
 I was looking at this wiki and linked issues:
 http://wiki.apache.org/solr/LotsOfCores

 they talk about a limit being 100K cores. Is that per server or per
 entire fleet because zookeeper needs to manage that?

 I was considering a use case where I have tens of millions of indices
 but less that a million needs to be active at any time, so they need
 to be loaded on demand and evicted when not used for a while.
 Also since number one requirement is efficient loading of course I
 assume I will store a prebuilt index somewhere so Solr will just
 download it and strap it in, right?

 The root issue is marked as won;t fix but some other important
 subissues are marked as resolved. What's the overall status of the
 effort?

 Thank you in advance,

 Aleksey


Request to be added to ContributorsGroup

2013-06-06 Thread Josh Lincoln
Hello Wiki Admins,

I have been using Solr for a few years now and I would like to
contribute back by making minor changes and clarifications to the wiki
documentation.

Wiki User Name : JoshLincoln


Thanks


Re: Request to be added to ContributorsGroup

2013-06-06 Thread Erick Erickson
Done, thanks!

On Thu, Jun 6, 2013 at 3:47 PM, Josh Lincoln josh.linc...@gmail.com wrote:
 Hello Wiki Admins,

 I have been using Solr for a few years now and I would like to
 contribute back by making minor changes and clarifications to the wiki
 documentation.

 Wiki User Name : JoshLincoln


 Thanks


Solr 4.1 over Websphere errors

2013-06-06 Thread abillavara

hi all

We are having a problem getting Solr4.1 (Solr 4.3 is also not starting) 
to run in Websphere on Windows.


Websphere version? [8.0.0.3]
Windows version? [Win7 64bit]
Solr version? [4.1]
JDK version? [1.7.0_13 64bit]


Here is the error that none of us have ever seen before.  Can somebody 
please help figure this one out?

Thanks
Anria

---
[6/6/13 11:50:47:102 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader locateSolrHome No /solr/home in 
JNDI
[6/6/13 11:50:47:102 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader locateSolrHome using system 
property solr.solr.home: C:\solr
[6/6/13 11:50:47:122 PDT] 0040 CoreContainer I 
org.apache.solr.core.CoreContainer$Initializer initialize looking for 
solr.xml: C:\solr\solr.xml
[6/6/13 11:50:47:125 PDT] 0040 CoreContainer I 
org.apache.solr.core.CoreContainer init New CoreContainer 397781387
[6/6/13 11:50:47:129 PDT] 0040 CoreContainer I 
org.apache.solr.core.CoreContainer load Loading CoreContainer using Solr 
Home: 'C:\solr\'
[6/6/13 11:50:47:130 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader init new SolrResourceLoader 
for directory: 'C:\solr\'
[6/6/13 11:50:47:134 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader replaceClassLoader Adding 
'file:/C:/solr/lib/commons-beanutils-1.7.0.jar' to classloader
[6/6/13 11:50:47:135 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader replaceClassLoader Adding 
'file:/C:/solr/lib/commons-collections-3.2.1.jar' to classloader
[6/6/13 11:50:47:135 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader replaceClassLoader Adding 
'file:/C:/solr/lib/solr-core-4.1.0.jar' to classloader
[6/6/13 11:50:47:135 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader replaceClassLoader Adding 
'file:/C:/solr/lib/solr-velocity-4.1.0.jar' to classloader
[6/6/13 11:50:47:137 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader replaceClassLoader Adding 
'file:/C:/solr/lib/velocity-1.7.jar' to classloader
[6/6/13 11:50:47:138 PDT] 0040 SolrResourceL I 
org.apache.solr.core.SolrResourceLoader replaceClassLoader Adding 
'file:/C:/solr/lib/velocity-tools-2.0.jar' to classloader
[6/6/13 11:50:47:224 PDT] 0040 SolrDispatchF E 
org.apache.solr.servlet.SolrDispatchFilter init Could not start Solr. 
Check solr/home property and the logs
[6/6/13 11:50:47:327 PDT] 0040 SolrCore  E 
org.apache.solr.common.SolrException log null:java.lang.VerifyError: 
JVMVRFY012 stack shape inconsistent; 
class=org/apache/lucene/codecs/lucene40/Lucene40FieldInfosReader, 
method=read(Lorg/apache/lucene/store/Directory;Ljava/lang/String;Lorg/apache/lucene/store/IOContext;)Lorg/apache/lucene/index/FieldInfos;, 
pc=28

at java.lang.J9VMInternals.verifyImpl(Native Method)
at 
java.lang.J9VMInternals.verify(J9VMInternals.java:90)
at 
java.lang.J9VMInternals.initialize(J9VMInternals.java:167)
at 
org.apache.lucene.codecs.lucene40.Lucene40FieldInfosFormat.init(Lucene40FieldInfosFormat.java:99)
at 
org.apache.lucene.codecs.lucene40.Lucene40Codec.init(Lucene40Codec.java:48)
at java.lang.J9VMInternals.newInstanceImpl(Native 
Method)

at java.lang.Class.newInstance(Class.java:1355)
at 
org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:62)
at 
org.apache.lucene.util.NamedSPILoader.init(NamedSPILoader.java:42)
at 
org.apache.lucene.util.NamedSPILoader.init(NamedSPILoader.java:37)
at 
org.apache.lucene.codecs.Codec.clinit(Codec.java:41)
at java.lang.J9VMInternals.initializeImpl(Native 
Method)
at 
java.lang.J9VMInternals.initialize(J9VMInternals.java:233)
at 
org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:181)
at 
org.apache.solr.core.SolrResourceLoader.init(SolrResourceLoader.java:113)
at 
org.apache.solr.core.SolrResourceLoader.init(SolrResourceLoader.java:229)
at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:421)
at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:404)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:336)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:98)
at 
com.ibm.ws.webcontainer.filter.FilterInstanceWrapper.init(FilterInstanceWrapper.java:142)
at 
com.ibm.ws.webcontainer.filter.WebAppFilterManager._loadFilter(WebAppFilterManager.java:566)
at 
com.ibm.ws.webcontainer.filter.WebAppFilterManager.loadFilter(WebAppFilterManager.java:473)
at 

Re: new xslt

2013-06-06 Thread Upayavira


On Thu, Jun 6, 2013, at 07:54 PM, Christopher Gross wrote:
 In 3.x Solr (and earlier) I was able to create a new xslt doc in the
 conf/xslt directory and immediately start using it.
 
 In my 4.1 setup, I have:
   queryResponseWriter name=xslt class=solr.XSLTResponseWriter
 int name=xsltCacheLifetimeSeconds5/int
   /queryResponseWriter
 
 But after that small wait I still can't use it.  Is there another setting
 that I'm missing somewhere?  I am using SolrCloud, do I need to have
 zookeeper push that change out?

I think you're right - these configs need to be uploaded to ZooKeeper.
You can use the cloud-scripts/zkCli.sh file in the example dir to upload
it.

Upayavira


nutch 1.4, solr 3.4 configuration error

2013-06-06 Thread Isaac Stennett
I am trying to configure nutch 1.4 with solr 3.4.

I configured everything and when I run the command:

./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2 -topN
2

I get the following error:

java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2013-06-06 15:49:30
SolrDeleteDuplicates: Solr url: http://localhost:8080
Exception in thread main java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing query
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error
executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
... 9 more
Caused by: org.apache.solr.common.SolrException: Not Found

Not Found

request: http://localhost:8080/select?q=id:[* TO
*]fl=idrows=1wt=javabinversion=2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
... 11 more


Other possibly helpful information:
1) The solr admin screen comes up fine in the browser.
2) I copied the schema.xml file that came with nutch into my solr core conf
directory
3) Again, nutch will run and crawl everything it's just that when it comes
time to post it to SOLR it throws this error.

I have configured everything I can think of, checked logs, and scoured the
Internet and have not been able to find a solution. If anybody has any
ideas on how I can resolve this I would be incredibly grateful.


Re: nutch 1.4, solr 3.4 configuration error

2013-06-06 Thread bbarani
can you check if you have correct solrj client library version in both nutch
and Solr server.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-1-4-solr-3-4-configuration-error-tp4068724p4068733.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.1 over Websphere errors

2013-06-06 Thread Shawn Heisey

On 6/6/2013 1:57 PM, abillav...@innoventsolutions.com wrote:

We are having a problem getting Solr4.1 (Solr 4.3 is also not starting)
to run in Websphere on Windows.

Websphere version? [8.0.0.3]
Windows version? [Win7 64bit]
Solr version? [4.1]
JDK version? [1.7.0_13 64bit]


Based on seeing java.lang.J9VMInternals in your log, I am guessing that 
your JVM is IBM's J9, not Oracle.  The Java from IBM is notoriously 
buggy when it comes to running Lucene and Solr.  Try Oracle, version 
1.7.0_21.


If I'm wrong about your JVM, then I have no idea what's wrong.  It looks 
like some kind of fundamental JVM or system problem.


Thanks,
Shawn



Re: Solr 4.1 over Websphere errors

2013-06-06 Thread bbarani
As suggested by Shawn try to change the JVM, this might resolve your issue.

I had seen this error ':java.lang.VerifyError' before (not specific to SOLR)
when compiling code using JDK1.7.

After some research I figured out the code compiled using Java 1.7 requires
stack map frame instructions. If you wish to modify Java 1.7 class files,
you need to use ClassWriter.COMPUTE_FRAMES or MethodVisit.visitFrame().

I was able to solve this issue by using the java option
-XX:UseSplitVerifier..



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-1-over-Websphere-errors-tp4068715p4068735.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.1 over Websphere errors

2013-06-06 Thread Chris Hostetter

: Based on seeing java.lang.J9VMInternals in your log, I am guessing that your
: JVM is IBM's J9, not Oracle.  The Java from IBM is notoriously buggy when it
: comes to running Lucene and Solr.  Try Oracle, version 1.7.0_21.

Note specifically the excellent verbage here...

http://wiki.apache.org/lucene-java/JavaBugs#IBM_J9_Bugs


-Hoss


Re: Auto-Suggest, spell check dictionary replication to slave issue

2013-06-06 Thread bbarani
Seems like this feature is still yet to be implemented..

https://issues.apache.org/jira/browse/SOLR-866



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-Suggest-spell-check-dictionary-replication-to-slave-issue-tp4068562p4068739.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nutch 1.4, solr 3.4 configuration error

2013-06-06 Thread Chris Hostetter
: ./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2 -topN
...
: Caused by: org.apache.solr.common.SolrException: Not Found
: 
: Not Found
: 
: request: http://localhost:8080/select?q=id:[* TO
: *]fl=idrows=1wt=javabinversion=2
...
: Other possibly helpful information:
: 1) The solr admin screen comes up fine in the browser.

At which URL does the Solr admin screen come up fine in your browser?

Best guess...

1) you have solr installed such that it uses the webcontext /solr but 
you gave the wrong url to nutch (ie: try -solr 
http://localhost:8080/solr;)

2) you are using multiple collections, and you may need to configure nutch 
to know about which collection you are using (ie: try -solr 
http://localhost:8080/solr/collection1;)

...if neither of those don't help, i would suggest you follow up with the 
nutch-user list, as the nutch community is probably in the best position 
to help you configure nutch to work with Solr and vice versa)


-Hoss


Re: Solr 4.1 over Websphere errors

2013-06-06 Thread Anria
Thank you 

This sure is a lot to chew on



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-1-over-Websphere-errors-tp4068715p4068740.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: OutOfMemory while indexing (PROD environment!)

2013-06-06 Thread Otis Gospodnetic
Hi,

Try running jstat to see if the heap is full. 4gb is not much and could
easily be eaten by structures used for sorting, facetting, and caching.

Plug: SPM has a new feature that lets you send graphs with various metrics
to Solr mailing list. I'd personally look at the GC graphs to see if GC
times and counts went up, plus cache graphs to see their utilization.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jun 6, 2013 2:26 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi everyone,

 My SolrCloud cluster (4.3.0) has came into production a few days ago.
 Docs are being indexed into Solr using /update requestHandler, as a POST
 request, containing text/xml content-type.

 The collection is sharded into 36 pieces, each shard has two replicas.
 There are 36 nodes (each node on separate virtual machine), so each node
 holds exactly 2 cores.

 Each update request contains 100 docs, what means 2-3 docs for each shard.
 There are 1-2 such requests every minute. Soft-commit happens every 10
 minutes, Hard-commit every 30 minutes, and ramBufferSizeMB=128.

 After 48 hours of zero problems, suddenly one shard went down (its both
 cores). Log says it's OOM (GC overhead limit exceeded). JVM is set to
 Xmx=4G.
 I'm pretty sure that some minutes before this incident, JVM memory wasn't
 so high (even the max memory usage indicator was below 2G).

 Indexing requests did not stop, and started getting HTTP 503 errors (no
 server hosting shard). At this time, some other cores started to go down
 (l had all of the rainbow colors: Active, Recovering, Down, Recovery Failed
 and Gone :).

 Then I tried to restart tomcat of the down nodes, but some of them failed
 to start, due to the error message: we are not the leader. Only shutting
 down the both two cores and starting them gradually, solved the problem,
 and the whole cluster came back to green state.

 Solr is not yet exposed to users, so no queries have been made at that time
 (but maybe some non-heavy auto-warm queries were executed).

 I don't think that all of the 4GB were being used for justifiable reasons..
 I guess that adding more RAM will not solve the problem, in the long term.

 Where should I start my log investigation? (about the OOM itself, and about
 the chain accident came after it)

 I did a search for previous similar issues. There are a lot, but most of
 them talks about very old versions of Solr.

 [Versions:
 Solr: 4.3.0
 Tomcat 7
 JVM: Oracle 7 (last, standard, JRE), 64bit.
 OS: RedHat 6.3]



Re: LotsOfCores feature

2013-06-06 Thread Erick Erickson
Now Jack. You know it depends G Just answer
the questions how many simultaneous cores can you
open on your hardware, and what's the maximum percentage
of the cores you expect to be open at any one time.
Do some math and you have your answer.

The meta-data, essentially anything in the core tag
or the core.properties file is kept in an in-memory structure. At
startup time, that structure has to be filled. I haven't measured
exactly, but it's relatively small (GUESS: 256 bytes) plus control
structures. So _theoretically_ you could put millions on a single
node. But you don't want to because:
1 if you're doing core discovery, you have to walk millions of
 directories every time you start up.
2 otherwise you're maintaining a huge solr.xml file (which will be
going away anyway).

Aleksey's use case also calls for less than a million or so open
at once. I can't imagine fitting that many cores into memory
simultaneously one one machine.

The design goal is 10-15K cores on a machine. The theory
is that pretty soon you're going to have a big enough percentage
of them open that you'll blow memory up.

And this is always governed by the size of the transient cache.
Pretty soon you'll be opening a core for each and every query if
you have more requests coming in for unique cores than your
cache size.

So, as usual, it's a matter of the usage pattern to determine how
many cores you can put on the machine.

FWIW,
Erick

On Thu, Jun 6, 2013 at 4:13 PM, Jack Krupansky j...@basetechnology.com wrote:
 So, is that a clear yes or a clear no for Aleksey's use case - 10's of
 millions of cores, not all active but each loadable on demand?

 I asked this same basic question months ago and there was no answer
 forthcoming.

 -- Jack Krupansky

 -Original Message- From: Erick Erickson
 Sent: Thursday, June 06, 2013 3:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: LotsOfCores feature


 100K is really not the limit, it's just hard to imagine
 100K cores on a single machine unless some were
 really rarely used. And it's per node, not cluster-wide.

 The current state is that everything is in place, including
 transient cores, auto-discovery, etc. So you should be
 able to go ahead and try it out.

 The next bit that will help with efficiency is sharing named
 config sets. The intent here is that solrhome/configs will
 contain sub-dirs like conf1, conf2 etc. Then your cores
 can reference configName=conf1 and only one copy of
 the configuration data will be used rather than re-loading one
 for each core as it comes up and down.

 Do note that the _first_ query in to one of the not-yet-loaded
 cores will be slow. The model here is that you can tolerate
 some queries taking more time at first than you might like
 in exchange for the hardware savings. This pre-supposes that
 you simply cannot fit all the cores into memory at once.

 The won't fix bits are there because, as we got farther into this
 process, the approach changed and the functionality of the
 won't fix JIRAs was subsumed by other changes by and large.

 I've got to update that documentation sometime, but just haven't
 had time yet. If you go down this route, we'll be happy to
 add your name to the authorized editors of the wiki list if you'd
 like.

 Best
 Erick

 On Thu, Jun 6, 2013 at 3:08 PM, Aleksey bitterc...@gmail.com wrote:

 I was looking at this wiki and linked issues:
 http://wiki.apache.org/solr/LotsOfCores

 they talk about a limit being 100K cores. Is that per server or per
 entire fleet because zookeeper needs to manage that?

 I was considering a use case where I have tens of millions of indices
 but less that a million needs to be active at any time, so they need
 to be loaded on demand and evicted when not used for a while.
 Also since number one requirement is efficient loading of course I
 assume I will store a prebuilt index somewhere so Solr will just
 download it and strap it in, right?

 The root issue is marked as won;t fix but some other important
 subissues are marked as resolved. What's the overall status of the
 effort?

 Thank you in advance,

 Aleksey




Re: Filtering on results with more than N words.

2013-06-06 Thread Jack Krupansky
From the book, here's an update request processor chain which will count the 
words in the content field and place it in the content_len_I field. Then 
you could do a range query on that count.


updateRequestProcessorChain name=regex-count-words

 !-- Start with a copy of the content field --
 processor class=solr.CloneFieldUpdateProcessorFactory
   str name=sourcecontent/str
   str name=destcontent_len_i/str
 /processor

 !-- Combine multivalued input into a single string --
 processor class=solr.ConcatFieldUpdateProcessorFactory
   str name=fieldNamecontent_len_i/str
   str name=delimiter /str
 /processor

 !-- Remove hyphens and underscores - join parts into single word --
 processor class=solr.RegexReplaceProcessorFactory
   str name=fieldNamecontent_len_i/str
   str name=pattern-|_/str
   str name=replacement/str
 /processor

 !-- Reduce words into a single letter X --
 processor class=solr.RegexReplaceProcessorFactory
   str name=fieldNamecontent_len_i/str
   str name=pattern\w+/str
   str name=replacementX/str
 /processor

 !-- Remove punctuation and white space, leaving just the Xes. --
 processor class=solr.RegexReplaceProcessorFactory
   str name=fieldNamecontent_len_i/str
   str name=pattern[^X]/str
   str name=replacement/str
 /processor

 !-- A count of the Xes is a good proxy for the word count. --
 processor class=solr.FieldLengthUpdateProcessorFactory
   str name=fieldNamecontent_len_i/str
 /processor

 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

Here's a test update using the Solr example schema, assuming you add the 
above URP chain to solrconfig:


curl 
http://localhost:8983/solr/update?commit=trueupdate.chain=regex-count-words; 
\

-H 'Content-type:application/json' -d '
[{id: doc-1, content: Hello World},
{id: doc-2, content: },
{id: doc-3, content:  -- --- !},
{id: doc-4, content: This is some more.},
{id: doc-5, content: The CD-ROM, (and num_events_seen.)},
{id: doc-6, content: Four score and seven years ago our fathers
   brought forth on this continent a new nation, conceived in liberty,
   and dedicated to the proposition that all men are created equal.
   Now we are engaged in a great civil war, testing whether that nation,
   or any nation so conceived and so dedicated, can long endure. },
{id: doc-7, content: 401(k)},
{id: doc-8, content: [And, this, is the end, of this test.]}]'

Results:

 id:doc-1,
 content:[Hello World],
 content_len_i:2,

 id:doc-2,
 content:[],
 content_len_i:0,

 id:doc-3,
 content:[ -- --- !],
 content_len_i:0,

 id:doc-4,
 content:[This is some more.],
 content_len_i:4,

 id:doc-5,
 content:[The CD-ROM, (and num_events_seen.)],
 content_len_i:4,

 id:doc-6,
 content:[Four score and seven years ago our fathers\n
 brought forth on this continent a new nation, conceived in liberty,\n
 and dedicated to the proposition that all men are created equal.\n
 Now we are engaged in a great civil war, testing whether that 
nation,\n

 or any nation so conceived and so dedicated, can long endure. ],
 content_len_i:54,

 id:doc-7,
 content:[401(k)],
 content_len_i:2,

 id:doc-8,
 content:[And, this,
   is the end,
   of this test.],
 content_len_i:8,

-- Jack Krupansky
-Original Message- 
From: Dotan Cohen

Sent: Thursday, June 06, 2013 3:45 AM
To: solr-user@lucene.apache.org
Subject: Filtering on results with more than N words.

Is there any way to restrict the search results to only those
documents with more than N words / tokens in the searched field? I
thought that this would be an easy one to Google for, but I cannot
figure it out. or find any references. There are many references to
word size in characters, but not to  filed size in words.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com 



Re: Configuring lucene to suggest the indexed string for all the searches of the substring of the indexed string

2013-06-06 Thread Otis Gospodnetic
Hi

Ngrams *will* do this for you.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jun 6, 2013 7:53 AM, Prathik Puthran prathik.puthra...@gmail.com
wrote:

 Basically I want the Suggester to return for Jason Bourne as suggestion
 for .*Bour.* regex.

 Thanks,
 Prathik


 On Thu, Jun 6, 2013 at 12:52 PM, Prathik Puthran 
 prathik.puthra...@gmail.com wrote:

  This works even now i.e. when I search for Jas it suggests Jason
  Bourne. What I want is when I search for Bour or ason (any
 substring)
  it should suggest me Jason Bourne .
 
 
  On Thu, Jun 6, 2013 at 12:34 PM, Upayavira u...@odoko.co.uk wrote:
 
  Can you se the ShingleFilterFactory? It is ngrams for terms rather than
  characters. If you limited it to two term ngrams, when the user presses
  space after their first word, you could do a suggested query against
  your two term ngram field, which would suggest Jason Bourne, Jason
  Statham, etc then you press space after Jason.
 
  Upayavira
 
  On Thu, Jun 6, 2013, at 07:25 AM, Prathik Puthran wrote:
   My use case is I want to search for any substring of the indexed
 string
   and
   the Suggester should suggest the indexed string. What can I do to make
   this
   work?
  
   Thanks,
   Prathik
  
  
   On Thu, Jun 6, 2013 at 2:05 AM, Mikhail Khludnev
   mkhlud...@griddynamics.com
wrote:
  
Please excuse my misunderstanding, but I always wonder why this
 index
  time
processing is suggested usually. from my POV is the case for
  query-time
processing i.e. PrefixQuery aka wildcard query Jason* .
Ultra-fast term retrieval also provided by TermsComponent.
   
   
On Wed, Jun 5, 2013 at 8:09 PM, Jack Krupansky 
  j...@basetechnology.com
wrote:
   
 ngrams?

 See:
 http://lucene.apache.org/core/**4_3_0/analyzers-common/org/**
 apache/lucene/analysis/ngram/**NGramFilterFactory.html
   
 
 http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html


 -- Jack Krupansky

 -Original Message- From: Prathik Puthran
 Sent: Wednesday, June 05, 2013 11:59 AM
 To: solr-user@lucene.apache.org
 Subject: Configuring lucene to suggest the indexed string for all
  the
 searches of the substring of the indexed string


 Hi,

 Is it possible to configure solr to suggest the indexed string for
  all
the
 searches of the substring of the string?

 Thanks,
 Prathik

   
   
   
--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics
   
http://www.griddynamics.com
 mkhlud...@griddynamics.com
   
 
 
 



[blogpost] Memory is overrated, use SSDs

2013-06-06 Thread Toke Eskildsen
Inspired by multiple Solr mailing list entries during the last month or two, I 
did some search performance testing on our 11M documents / 49GB index using 
logged queries on Solr 4 with MMapDirectory. It turns out that our setup with 
Solid State Drives and 8GB of RAM (which leaves 5GB for disk cache) performs 
nearly as well as having the whole index in disk cache; the SSD solution 
delivering ~425 q/s for non-faceted searches and the memory solution delivering 
~475 q/s (roughly estimated from the graphs, sorry). Going full memory cache 
certainly is faster if we ignore warmup, but those last queries/second are 
quite expensive.

http://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/

Regards,
Toke Eskildsen, State and University Library, Denmark

Re: LotsOfCores feature

2013-06-06 Thread Aleksey
I would not try putting tens of millions of cores on one machine. My
question (and I think Jack's as well) was around having them across a
fleet, say if I need 1M then I'd get 100 machines appropriately sized
for 10K each. I was clarifying because there was some talk about
ZooKeeper only being able to store small amount of configuration and
there were concerns that it won't keep information about which core is
where if it's millions.

This question is still open in my mind, since I haven't yet
familiarized myself with how ZK works.




On Thu, Jun 6, 2013 at 3:23 PM, Erick Erickson erickerick...@gmail.com wrote:
 Now Jack. You know it depends G Just answer
 the questions how many simultaneous cores can you
 open on your hardware, and what's the maximum percentage
 of the cores you expect to be open at any one time.
 Do some math and you have your answer.

 The meta-data, essentially anything in the core tag
 or the core.properties file is kept in an in-memory structure. At
 startup time, that structure has to be filled. I haven't measured
 exactly, but it's relatively small (GUESS: 256 bytes) plus control
 structures. So _theoretically_ you could put millions on a single
 node. But you don't want to because:
 1 if you're doing core discovery, you have to walk millions of
  directories every time you start up.
 2 otherwise you're maintaining a huge solr.xml file (which will be
 going away anyway).

 Aleksey's use case also calls for less than a million or so open
 at once. I can't imagine fitting that many cores into memory
 simultaneously one one machine.

 The design goal is 10-15K cores on a machine. The theory
 is that pretty soon you're going to have a big enough percentage
 of them open that you'll blow memory up.

 And this is always governed by the size of the transient cache.
 Pretty soon you'll be opening a core for each and every query if
 you have more requests coming in for unique cores than your
 cache size.

 So, as usual, it's a matter of the usage pattern to determine how
 many cores you can put on the machine.

 FWIW,
 Erick

 On Thu, Jun 6, 2013 at 4:13 PM, Jack Krupansky j...@basetechnology.com 
 wrote:
 So, is that a clear yes or a clear no for Aleksey's use case - 10's of
 millions of cores, not all active but each loadable on demand?

 I asked this same basic question months ago and there was no answer
 forthcoming.

 -- Jack Krupansky

 -Original Message- From: Erick Erickson
 Sent: Thursday, June 06, 2013 3:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: LotsOfCores feature


 100K is really not the limit, it's just hard to imagine
 100K cores on a single machine unless some were
 really rarely used. And it's per node, not cluster-wide.

 The current state is that everything is in place, including
 transient cores, auto-discovery, etc. So you should be
 able to go ahead and try it out.

 The next bit that will help with efficiency is sharing named
 config sets. The intent here is that solrhome/configs will
 contain sub-dirs like conf1, conf2 etc. Then your cores
 can reference configName=conf1 and only one copy of
 the configuration data will be used rather than re-loading one
 for each core as it comes up and down.

 Do note that the _first_ query in to one of the not-yet-loaded
 cores will be slow. The model here is that you can tolerate
 some queries taking more time at first than you might like
 in exchange for the hardware savings. This pre-supposes that
 you simply cannot fit all the cores into memory at once.

 The won't fix bits are there because, as we got farther into this
 process, the approach changed and the functionality of the
 won't fix JIRAs was subsumed by other changes by and large.

 I've got to update that documentation sometime, but just haven't
 had time yet. If you go down this route, we'll be happy to
 add your name to the authorized editors of the wiki list if you'd
 like.

 Best
 Erick

 On Thu, Jun 6, 2013 at 3:08 PM, Aleksey bitterc...@gmail.com wrote:

 I was looking at this wiki and linked issues:
 http://wiki.apache.org/solr/LotsOfCores

 they talk about a limit being 100K cores. Is that per server or per
 entire fleet because zookeeper needs to manage that?

 I was considering a use case where I have tens of millions of indices
 but less that a million needs to be active at any time, so they need
 to be loaded on demand and evicted when not used for a while.
 Also since number one requirement is efficient loading of course I
 assume I will store a prebuilt index somewhere so Solr will just
 download it and strap it in, right?

 The root issue is marked as won;t fix but some other important
 subissues are marked as resolved. What's the overall status of the
 effort?

 Thank you in advance,

 Aleksey




Re: [blogpost] Memory is overrated, use SSDs

2013-06-06 Thread Shawn Heisey
 Inspired by multiple Solr mailing list entries during the last month or
 two, I did some search performance testing on our 11M documents / 49GB
 index using logged queries on Solr 4 with MMapDirectory. It turns out that
 our setup with Solid State Drives and 8GB of RAM (which leaves 5GB for
 disk cache) performs nearly as well as having the whole index in disk
 cache; the SSD solution delivering ~425 q/s for non-faceted searches and
 the memory solution delivering ~475 q/s (roughly estimated from the
 graphs, sorry). Going full memory cache certainly is faster if we ignore
 warmup, but those last queries/second are quite expensive.

 http://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/

This is awesome! Concrete info is better than speculation.

I think it might be time to split the SSD section of
SolrPerformanceProblems into its own wiki page and expand it.

Have you come across any way yet to have RAID1 with TRIM support,
especially on Linux? RAID10 would be even better.  This is the hurdle that
keeps me from getting serious about SSD.

Thanks,
Shawn




Re: LotsOfCores feature

2013-06-06 Thread Jack Krupansky
I'm glad Erick finally answered my question (I think I actually asked it on 
the original Jira) concerning the rough magnitude of Lots - it's 
hundreds/thousands, but not hundreds of thousands, millions, or tens of 
millions.


So, if an app needs millions, I think that suggests a MegaCores 
capability distinct from LotsOfCores.


A use case would a web site or service that had millions of users, each of 
whom would have an active Solr core when they are active, but inactive 
otherwise. Of course those cores would not all reside on one node and 
ZooKeeper is out of the question for managing anything that is in the 
millions. This would be a true cloud or data center and even multi-data 
center app, not a cluster app.


So, I imagine that the app's cloud would have ZooKeeper-like servers whose 
job is to know all the available servers in the cloud and what Solr cores 
are running on them and how much spare capacity they have. If a request 
comes in to find a user's Solr, the CloudKeeper would consult its database 
(probably a Solr core with millions of rows!) for the current location 
and status of the user's core. If the core is active, great, its location is 
returned. If not active, CK would check to see if the node on which it 
resides has sufficient spare compute capacity. If so, the user's Solr core 
would be spun up. If not, CK would find a machine with plenty of spare 
capacity, send a request to that node to pull the inactive core from the 
busy machine to the new node (or from a backup store of long idle Solr 
cores). Once the new node has the user's Solr core up, the node notifies CK 
of its status and CK updates its database. Meanwhile, the original client 
request would have returned with an in progress status and the client 
would periodically ping CK to see if progress had completed.


And then there would probably be an idle timeout that would cause a Solr 
core to spin down and notify CK that it is inactive.


Or something like that.

This would be a lot more of a true Solr Cloud than the cluster support 
that we have today.


And the CloudKeeper itself might be a traditional SolrCloud cluster, 
except that it needs to be multi-data center.


-- Jack Krupansky

-Original Message- 
From: Aleksey

Sent: Thursday, June 06, 2013 8:06 PM
To: solr-user
Subject: Re: LotsOfCores feature

I would not try putting tens of millions of cores on one machine. My
question (and I think Jack's as well) was around having them across a
fleet, say if I need 1M then I'd get 100 machines appropriately sized
for 10K each. I was clarifying because there was some talk about
ZooKeeper only being able to store small amount of configuration and
there were concerns that it won't keep information about which core is
where if it's millions.

This question is still open in my mind, since I haven't yet
familiarized myself with how ZK works.




On Thu, Jun 6, 2013 at 3:23 PM, Erick Erickson erickerick...@gmail.com 
wrote:

Now Jack. You know it depends G Just answer
the questions how many simultaneous cores can you
open on your hardware, and what's the maximum percentage
of the cores you expect to be open at any one time.
Do some math and you have your answer.

The meta-data, essentially anything in the core tag
or the core.properties file is kept in an in-memory structure. At
startup time, that structure has to be filled. I haven't measured
exactly, but it's relatively small (GUESS: 256 bytes) plus control
structures. So _theoretically_ you could put millions on a single
node. But you don't want to because:
1 if you're doing core discovery, you have to walk millions of
 directories every time you start up.
2 otherwise you're maintaining a huge solr.xml file (which will be
going away anyway).

Aleksey's use case also calls for less than a million or so open
at once. I can't imagine fitting that many cores into memory
simultaneously one one machine.

The design goal is 10-15K cores on a machine. The theory
is that pretty soon you're going to have a big enough percentage
of them open that you'll blow memory up.

And this is always governed by the size of the transient cache.
Pretty soon you'll be opening a core for each and every query if
you have more requests coming in for unique cores than your
cache size.

So, as usual, it's a matter of the usage pattern to determine how
many cores you can put on the machine.

FWIW,
Erick

On Thu, Jun 6, 2013 at 4:13 PM, Jack Krupansky j...@basetechnology.com 
wrote:

So, is that a clear yes or a clear no for Aleksey's use case - 10's of
millions of cores, not all active but each loadable on demand?

I asked this same basic question months ago and there was no answer
forthcoming.

-- Jack Krupansky

-Original Message- From: Erick Erickson
Sent: Thursday, June 06, 2013 3:53 PM
To: solr-user@lucene.apache.org
Subject: Re: LotsOfCores feature


100K is really not the limit, it's just hard to imagine
100K cores on a single machine unless some were
really 

RE: [blogpost] Memory is overrated, use SSDs

2013-06-06 Thread Toke Eskildsen
Shawn Heisey [s...@elyograg.org]:
 This is awesome! Concrete info is better than speculation.

Thank you.

 I think it might be time to split the SSD section of
 SolrPerformanceProblems into its own wiki page and expand it.

That might be a good idea. It would also be interesting to try and measure a 
mixed read/write environment.

 Have you come across any way yet to have RAID1 with TRIM support,
 especially on Linux? RAID10 would be even better.  This is the hurdle that
 keeps me from getting serious about SSD.

According to Wikipedia there is some support, but I do not know the details
http://en.wikipedia.org/wiki/TRIM#RAID_issues

Searching a bit, it seems that TRIM is supported with lvm, which makes it 
possible to have a single logical volume spanning multiple SSDs. This does not 
have the bulk-speed benefit of a striped RAID-0 setup, but the TRIM-win might 
make up for that.

It would be interesting to see how fast (or slow) SSD performance degenerates 
when updating a Solr index. Having large bulk writes should be quite agreeable 
with SSDs.

Regards,
Toke Eskildsen

HdfsDirectoryFactory

2013-06-06 Thread Jamie Johnson
I've seen reference to an HdfsDirectoryFactory in the new Cloudera Search
along with a commit in the Solr SVN (
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-tlog.xml?view=markup),
is this something that is being made part of the core?  I've seen
discussions in the past where folks have recommended not using an HDFS
based DirectoryFactory for reasons like speed, any details/information that
can be provided would be really appreciated.


Re: Can't find solr.xml

2013-06-06 Thread Anria
Nabeel,

I just want to say, that though this post is very old, in the entire
internet of this error, your suggestion of moving out of /home/user/solr  
into  /opt/solr   was the one that worked for me too

Thank you! 
Anria



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-find-solr-xml-tp3992267p4068768.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
3. Too hard to say from the way you have described it. Show us some sample
input.

Jack,

Here you go.

*Row X*
column1: data here
column2: more data here
...
user_id: 2002

*Row Y*
column1: data here
column2: more data here
...
user_id: 45

*Row Z*
column1: data here
column2: more data here
...
user_id: 45664

So what I plan on doing before inserting into mysql, which is where solr
pulls the data from, is shrinking similar datasets into one row:

*Single Row XYZ*
column1: data here
column2: more data here
...
user_id: 2002 45 45664

Then I would like to have solr parse the user_id as a string.  I just want
to be sure that there wont be any fuzzy searching happening against the
user_id.  That is, 566 shouldn't be a valid value for the user_id list
above.  It has to return exact results based on user ids.  Also I am
wondering if this will affect performance at all, but I am thinking not
because solr is very fast in general.

Regards,
Nate


Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
I want to query against one user_id in the string.

eg  user_id:2002+AND+created:[${from}+TO+${until}]+data:more

So all of the records with a 2002 in user_id need to be returned and only
those records.  If this can only be guaranteed by having user_id be an
integer, then that is fine, but I would like to reduce the growth of our
table.

*Row X*

 column1: data here
 column2: more data here
 ...
 user_id: 2002

 *Row Y*

 column1: data here
 column2: more data here
 ...
 user_id: 45

 *Row Z*

 column1: data here
 column2: more data here
 ...
 user_id: 45664

 So what I plan on doing before inserting into mysql, which is where solr
 pulls the data from, is shrinking similar datasets into one row:

 *Single Row XYZ*

 column1: data here
 column2: more data here
 ...
 user_id: 2002 45 45664



Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread Jack Krupansky
Okay, now, how about a few queries that you want to use? Do you want to 
query by parts of the user ID, or only by the whole (exact) value?


If the user ID will be a string, fine, but having spaces makes it a little 
more painful to enter in a query - maybe use dashes.


-- Jack Krupansky

-Original Message- 
From: z z

Sent: Thursday, June 06, 2013 11:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


3. Too hard to say from the way you have described it. Show us some sample
input.

Jack,

Here you go.

*Row X*
column1: data here
column2: more data here
...
user_id: 2002

*Row Y*
column1: data here
column2: more data here
...
user_id: 45

*Row Z*
column1: data here
column2: more data here
...
user_id: 45664

So what I plan on doing before inserting into mysql, which is where solr
pulls the data from, is shrinking similar datasets into one row:

*Single Row XYZ*
column1: data here
column2: more data here
...
user_id: 2002 45 45664

Then I would like to have solr parse the user_id as a string.  I just want
to be sure that there wont be any fuzzy searching happening against the
user_id.  That is, 566 shouldn't be a valid value for the user_id list
above.  It has to return exact results based on user ids.  Also I am
wondering if this will affect performance at all, but I am thinking not
because solr is very fast in general.

Regards,
Nate 



Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
eg  user_id:2002+AND+created:[${from}+TO+${until}]+data:more

Expected results:  return row XYZ but ignore this row:

column1: data here
column2: more data here
...
user_id: 45 15001 45664



 *Row X*

 column1: data here
 column2: more data here
 ...
 user_id: 2002

 *Row Y*

 column1: data here
 column2: more data here
 ...
 user_id: 45

 *Row Z*

 column1: data here
 column2: more data here
 ...
 user_id: 45664

 So what I plan on doing before inserting into mysql, which is where solr
 pulls the data from, is shrinking similar datasets into one row:

 *Single Row XYZ*

 column1: data here
 column2: more data here
 ...
 user_id: 2002 45 45664




Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread Jack Krupansky
In that case, you will need to keep two copies of the user ID, one which is 
a single, complete string, and one which is a tokenized field text/TextField 
so that you can do a keyword search against it. Use the string/StrField as 
the main copy and then use a copyField directive in the schema to copy 
from the main copy to the other copy.


So, maybe user_id is the full unique key - you would have to specify, the 
full exact key to query against it, or use wildcards for partial matches, 
and user or user_id_str would be the tokenized text version that would 
allow a simple search by partial value, such as 2002.


Even so, I'm still not convinced that you have given us your complete 
requirements. Is the user_id in fact the unique key for the documents?


-- Jack Krupansky

-Original Message- 
From: z z

Sent: Thursday, June 06, 2013 11:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


I want to query against one user_id in the string.

eg  user_id:2002+AND+created:[${from}+TO+${until}]+data:more

So all of the records with a 2002 in user_id need to be returned and only
those records.  If this can only be guaranteed by having user_id be an
integer, then that is fine, but I would like to reduce the growth of our
table.

*Row X*


column1: data here
column2: more data here
...
user_id: 2002

*Row Y*

column1: data here
column2: more data here
...
user_id: 45

*Row Z*

column1: data here
column2: more data here
...
user_id: 45664

So what I plan on doing before inserting into mysql, which is where solr
pulls the data from, is shrinking similar datasets into one row:

*Single Row XYZ*

column1: data here
column2: more data here
...
user_id: 2002 45 45664





Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
The unique key is an auto-incremented int in the db.  Sorry for having
given the impression that user_id is the unique key per document.  This is
a table of events that are happening as users interact with our system.
It just so happens that we were inserting individual records for each user
before we even began to think about using something like Solr.  Now,
however, it seems to me that we should be able to ask questions like give
me all records for user 2002 that have this string value more in data2,
across this time stamp range [  ].  Several simultaneously inserted
rows into the db are exactly the same aside from the user_ids.  I just want
to know beforehand if I can still maintain exact matches for a user if the
user_id becomes a string of concatenated user id values.

From what you are saying it sounds like the user_id_str is really all I
need.  It is tokenized and allows for partial searches.  I just want to
make sure that 2002 15000 45 when tokenized doesn't allow 20 to
partially match the token 2002.

On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.comwrote:

 In that case, you will need to keep two copies of the user ID, one which
 is a single, complete string, and one which is a tokenized field
 text/TextField so that you can do a keyword search against it. Use the
 string/StrField as the main copy and then use a copyField directive in
 the schema to copy from the main copy to the other copy.

 So, maybe user_id is the full unique key - you would have to specify,
 the full exact key to query against it, or use wildcards for partial
 matches, and user or user_id_str would be the tokenized text version
 that would allow a simple search by partial value, such as 2002.

 Even so, I'm still not convinced that you have given us your complete
 requirements. Is the user_id in fact the unique key for the documents?




Re: LotsOfCores feature

2013-06-06 Thread Shawn Heisey
On 6/6/2013 6:32 PM, Jack Krupansky wrote:

big snip

 This would be a lot more of a true Solr Cloud than the cluster
 support that we have today.
 
 And the CloudKeeper itself might be a traditional SolrCloud cluster,
 except that it needs to be multi-data center.

I like a lot of what you said in the huge section that I didn't quote.
It inspired a few ideas.

Recently I was thinking about how we might change the names of certain
things in Solr to get rid of historical throwbacks, given that we are
redefining solr.xml and other config files in the dev branches.  Your
ideas are similar to something I thought about where you'd have an
abstraction higher than a collection, something for which I couldn't
think of a name.

Another idea: One characteristic of SolrCloud is that the master/slave
model goes away.  In some ways, this is a very good thing, but it does
get rid of the ability to index on one set of machines and query on another.

What if we combined master/slave replication with SolrCloud?  What I'm
envisioning here is a master cloud with a low replication factor like 2
or 3, and a slave cloud with a potentially high replication actor.  They
would actually be part of the same cloud, sharing a zookeeper ensemble.
 It would need to support the ability to split configurations, either
with two config sets for one cloud or the ability to include master and
slave configs, similar to how we split index and query analyzers in the
schema.

Related side issue: The fact that SolrCloud uses the standard
replication handler has led to lots of confusion.  People look at the
replication section for their cores and are very confused by what they
see there, and when we tell them that SolrCloud's replicas don't
normally use replication, they get REALLY confused.  How about we set
aside a dedicated handler name (/cloudreplication, perhaps) for an
internally defined replication handler specific for SolrCloud recovery?

Thanks,
Shawn



Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread Jack Krupansky
To be clear, one normally doesn't do queries on portions of an ID - 
usually it is one integrated string.


Further strings are definitely NOT tokenized in Solr.

Your story keeps changing, which is why I have to keep hedging my answers.

At least with your latest store, your user_id should be a text/TextField so 
that it will be tokenized. A query for 2002 will
match on complete tokens, not parts of tokens. If you want to match exactly 
on the full user_id, use a quoted phrase for the full user_id.


But... I still have to hedge, because you refer to a string of concatenated 
user id values. You seem to have two distinct definitions for user id.


So, until you disclose all of your requirements and your data model, 
including a clarification about user id vs. a string of concatenated user 
id values, I can't answer your question definitively, other than Maybe, 
depending on what you really mean by user id.


-- Jack Krupansky

-Original Message- 
From: z z

Sent: Friday, June 07, 2013 12:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


The unique key is an auto-incremented int in the db.  Sorry for having
given the impression that user_id is the unique key per document.  This is
a table of events that are happening as users interact with our system.
It just so happens that we were inserting individual records for each user
before we even began to think about using something like Solr.  Now,
however, it seems to me that we should be able to ask questions like give
me all records for user 2002 that have this string value more in data2,
across this time stamp range [  ].  Several simultaneously inserted
rows into the db are exactly the same aside from the user_ids.  I just want
to know beforehand if I can still maintain exact matches for a user if the
user_id becomes a string of concatenated user id values.


From what you are saying it sounds like the user_id_str is really all I

need.  It is tokenized and allows for partial searches.  I just want to
make sure that 2002 15000 45 when tokenized doesn't allow 20 to
partially match the token 2002.

On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky 
j...@basetechnology.comwrote:



In that case, you will need to keep two copies of the user ID, one which
is a single, complete string, and one which is a tokenized field
text/TextField so that you can do a keyword search against it. Use the
string/StrField as the main copy and then use a copyField directive in
the schema to copy from the main copy to the other copy.

So, maybe user_id is the full unique key - you would have to specify,
the full exact key to query against it, or use wildcards for partial
matches, and user or user_id_str would be the tokenized text version
that would allow a simple search by partial value, such as 2002.

Even so, I'm still not convinced that you have given us your complete
requirements. Is the user_id in fact the unique key for the documents?






Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
My language might be a bit off (I am saying string when I probably mean
text in the context of solr), but I'm pretty sure that my story is
unwavering ;)

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` int(11)

So, imagine that we have 1000 entries come in where data above is exactly
the same for all 1000 entries, but user_id is different (id and created
being different is irrelevant).  I am thinking that prior to inserting into
mysql, I should be able to concatenate the user_ids together with
whitespace and then insert them into something like:

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` blob

Then on solr's end it will treat the user_id as Text and parse it (I want
to say tokenize, but maybe my language is incorrect here?).

Then when I search

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

I want to be sure that if I look for user_id 2002, I will get data that
only has a value 2002 in the user_id column and that a separate user with
id 20 cannot accidentally pull data for user_id 2002 as a result of a
fuzzy (my language ok?) match of 20 against (20)02.

Current schema definition:

 field name=user_id type=int indexed=true stored=true/

New schema definition:

field name=user_id type=user_id_string indexed=true
stored=true/
...
fieldType name=user_id_string class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory
maxTokenLength=120/
  /analyzer
/fieldType

I am obviously not a 1337 solr haxor :P

Why do this?  We have a lot of data coming in and I want to compact it as
best as I can.

Regards,
Nate





On Fri, Jun 7, 2013 at 1:23 PM, Jack Krupansky j...@basetechnology.comwrote:

 To be clear, one normally doesn't do queries on portions of an ID -
 usually it is one integrated string.

 Further strings are definitely NOT tokenized in Solr.

 Your story keeps changing, which is why I have to keep hedging my answers.

 At least with your latest store, your user_id should be a text/TextField
 so that it will be tokenized. A query for 2002 will
 match on complete tokens, not parts of tokens. If you want to match
 exactly on the full user_id, use a quoted phrase for the full user_id.

 But... I still have to hedge, because you refer to a string of
 concatenated user id values. You seem to have two distinct definitions for
 user id.

 So, until you disclose all of your requirements and your data model,
 including a clarification about user id vs. a string of concatenated user
 id values, I can't answer your question definitively, other than Maybe,
 depending on what you really mean by user id.


 -- Jack Krupansky

 -Original Message- From: z z
 Sent: Friday, June 07, 2013 12:11 AM

 To: solr-user@lucene.apache.org
 Subject: Re: Schema Change: Int - String (i am the original poster, new
 email address)

 The unique key is an auto-incremented int in the db.  Sorry for having
 given the impression that user_id is the unique key per document.  This is
 a table of events that are happening as users interact with our system.
 It just so happens that we were inserting individual records for each user
 before we even began to think about using something like Solr.  Now,
 however, it seems to me that we should be able to ask questions like give
 me all records for user 2002 that have this string value more in data2,
 across this time stamp range [  ].  Several simultaneously inserted
 rows into the db are exactly the same aside from the user_ids.  I just want
 to know beforehand if I can still maintain exact matches for a user if the
 user_id becomes a string of concatenated user id values.

 From what you are saying it sounds like the user_id_str is really all I
 need.  It is tokenized and allows for partial searches.  I just want to
 make sure that 2002 15000 45 when tokenized doesn't allow 20 to
 partially match the token 2002.

 On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  In that case, you will need to keep two copies of the user ID, one which
 is a single, complete string, and one which is a tokenized field
 text/TextField so that you can do a keyword search against it. Use the
 string/StrField as the main copy and then use a copyField directive in
 the schema to copy from the main copy to the other copy.

 So, maybe user_id is the full unique key - you would have to specify,
 the full exact key to query against it, or use wildcards for partial
 matches, and user or user_id_str would be the tokenized text version
 that would allow a simple search by partial value, such as 2002.

 Even so, I'm still not convinced that you have given us your complete
 requirements. Is the user_id in fact the unique key for the documents?






Re: [blogpost] Memory is overrated, use SSDs

2013-06-06 Thread Andy
This is very interesting. Thanks for sharing the benchmark.

One question I have is did you precondition the SSD ( 
http://www.sandforce.com/userfiles/file/downloads/FMS2009_F2A_Smith.pdf )? SSD 
performance tends to take a very deep dive once all blocks are written at least 
once and the garbage collector kicks in. 



 From: Toke Eskildsen t...@statsbiblioteket.dk
To: solr-user@lucene.apache.org solr-user@lucene.apache.org 
Sent: Thursday, June 6, 2013 7:11 PM
Subject: [blogpost] Memory is overrated, use SSDs
 

Inspired by multiple Solr mailing list entries during the last month or two, I 
did some search performance testing on our 11M documents / 49GB index using 
logged queries on Solr 4 with MMapDirectory. It turns out that our setup with 
Solid State Drives and 8GB of RAM (which leaves 5GB for disk cache) performs 
nearly as well as having the whole index in disk cache; the SSD solution 
delivering ~425 q/s for non-faceted searches and the memory solution delivering 
~475 q/s (roughly estimated from the graphs, sorry). Going full memory cache 
certainly is faster if we ignore warmup, but those last queries/second are 
quite expensive.

http://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/

Regards,
Toke Eskildsen, State and University Library, Denmark

Re: OR query with null value and non-null value(s)

2013-06-06 Thread Rahul R
Thank you Shawn. This does work. To help me understand better, why do
we need the *:* ? Shouldn't it be implicit ?
Shouldn't
fq=(price:4+OR+(-price:[* TO *]))  //does not work
mean the same as
fq=(price:4+OR+(*:* -price:[* TO *]))   //works

Why does Solr need the *:* there ?




On Fri, Jun 7, 2013 at 12:07 AM, Shawn Heisey s...@elyograg.org wrote:

 On 6/6/2013 12:28 PM, Rahul R wrote:

 I have recently enabled facet.missing=true in solrconfig.xml which gives
 null facet values also. As I understand it, the syntax to do a faceted
 search on a null value is something like this:
 fq=-price:[* TO *]
 So when I want to search on a particular value (for example : 4)  OR null
 value, I would expect the syntax to be something like this:
 fq=(price:4+OR+(-price:[* TO *]))
 But this does not work. After searching around for more, read somewhere
 that the right way to achieve this would be:
 fq=-(-price:4+AND+price:[*+TO+***])
 Now this does work but seems like a very roundabout way. Is there a better
 way to achieve this ?


 Pure negative queries don't work -- you have to have results in the query
 before you can subtract.  For some top-level queries, Solr is able to
 detect this situation and fix it internally, but on inner queries you must
 explicitly state your intentions.  It is best if you always use '*:*
 -query' syntax, just to be safe.

 fq=(price:4+OR+(*:* -price:[* TO *]))

 Thanks,
 Shawn




Re: OR query with null value and non-null value(s)

2013-06-06 Thread Shawn Heisey
On 6/6/2013 11:21 PM, Rahul R wrote:
 Thank you Shawn. This does work. To help me understand better, why do
 we need the *:* ? Shouldn't it be implicit ?
 Shouldn't
 fq=(price:4+OR+(-price:[* TO *]))  //does not work
 mean the same as
 fq=(price:4+OR+(*:* -price:[* TO *]))   //works
 
 Why does Solr need the *:* there ?

When you are excluding with the - (NOT) operator, you can't exclude from
nothing, you have to exclude from something.  The *:* tells Solr to
start with everything, then begin excluding whatever matches the
negative query.

You might ask why it works with your earlier example, which is this:

fq=-price:[* TO *]

When Solr encounters a very simple top level query like this, it is able
to detect the problem and fix it, by adding the *:* behind the scenes.
I'm not sure when this negative query detection was added, but before
that version, those queries didn't ever work.  Unfortunately the
detection doesn't work with more complex queries.  It's one of those
things that a person can do easily but is very hard for a computer.

Thanks,
Shawn