Re: Solr Pagination

2015-10-12 Thread Jan Høydahl
Salman,

You say that you optimized your index from Admin. You should not do that, 
however strange it sounds.
70M docs on 2 shards means 35M docs per shard. What you do when you call 
optimize is to force Lucene
to merge all those 35M docs into ONE SINGLE index segment. You get better HW 
utilization if you let
Lucene/Solr automatically handle merging, meaning you’ll have around 10 smaller 
segments that are faster to
search across than one huge segment.

Your cache settings are way too high. Remember “size” here is number of 
*entries* not number of bytes.
Start with, say, 100 - and then let the system run for a while with realistic 
query load, and then
determine based on the cache statistics whether you have a high hit rate (the 
cache is useful) and
a high eviction rate (could indicate that you would benefit from an increase).

I would not concern myself with high paging offsets unless there is something 
very special about your
usecase which justifies this as a usecase to focus much energy on. People just 
don’t page beyond page 10 :)
and if they do you should focus on improving the relevancy first - unless you 
got a very special use case...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 11. okt. 2015 kl. 06.54 skrev Shawn Heisey :
> 
> On 10/10/2015 2:55 AM, Salman Ansari wrote:
>> Thanks Shawn for your response. Based on that
>> 1) Can you please direct me where I can get more information about cold
>> shard vs hot shard?
> 
> I don't know of any information out there about hot/cold shards.  I can
> describe it, though:
> 
> A split point is determined.  Everything older than the split point gets
> divided by some method (usually hashing) between multiple cold shards.
> Everything newer than the split point goes into the hot shard.  For my
> index, there is only one hot shard, but it is possible to have multiple
> hot shards.
> 
> On some interval (nightly in my index), the split point is adjusted and
> documents are moved from the hot shard to the cold shards according to
> that split point.  The hot shard is typically a lot smaller than the
> cold shards, which helps increase indexing speed for new documents.
> 
> I am not using SolrCloud. I manage all my own sharding. There is no
> capability included in SolrCloud that can do hot/cold sharding.
> 
>> 2)  That 10GB number assumes there's no other software on the machine, like
>> a database server or a webserver.
>> Yes the machine is dedicated for Solr
>> 
>> 3) How much index data is on the machine?
>> I have 3 collections 2 for testing (so the aggregate of both of them does
>> not exceed 1M document) and the main collection that I am querying now
>> which contains around 69M. I have distributed all my collections into 2
>> shards each with 2 replicas. The consumption on the hard disk is about 40GB.
> 
> That sounds like a recipe for a performance problem, although I am not
> certain why the problem persisted after increasing the memory.  Perhaps
> it has something to do with the filterCache, which I will get to further
> down.
> 
>> 4) A memory size of 14GB would be unusual for a physical machine, and makes 
>> me
>> wonder if you're using virtual machines
>> Yes I am using virtual machine as using a bare metal will be difficult in
>> my case as all of our data center is on the cloud. I can increase its
>> capacity though. While testing some edge cases on Solr, I realized on Solr
>> admin that the memory sometimes reaches to its limit (14GB RAM, and 4GB JVM)
> 
> This is how operating systems and Java are designed to work.  When
> things are running well, all of physical memory might be allocated, and
> the heap will become full on a semi-regular basis.  If it *stays* full,
> that usually means it needs to be larger.  The admin UI is a poor tool
> for watching JVM memory usage.
> 
>> 5) Just to confirm, I have combined the lessons from
>> 
>> http://www.slideshare.net/lucidworks/high-performance-solr-and-jvm-tuning-strategies-used-for-map-quests-search-ahead-darren-spehr
>> AND
>> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>> 
>> to come up with the following settings
>> 
>> FilterCache
>> 
>>> size="16384"
>> initialSize="4096"
>> autowarmCount="4096"/>
> 
> That's a very very large cache size.  It is likely to use a VERY large
> amount of heap, and autowarming up to 4096 entries at commit time might
> take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
> index core with 70 million documents, each filterCache entry is at least
> 8.75 million bytes.  Multiply that by 16384, and a completely full cache
> would need about 140GB of heap memory.  4096 entries will require 35GB.
> I don't think this cache is actually storing that many entries, or you
> would most certainly be running into OutOfMemoryError exceptions.
> 
>>>   size="16384"
>>   initialSize="16384"
>>  

Re: Indexing logs when using post,jar

2015-10-12 Thread Jan Høydahl
Hi

The answer is no. When you run the tool you are responsible to redirect its 
output to file yourself if you want to keep it.
Also, the tool is mostly meant as a quick way to post docs during development 
and testing, not for production.
A tool built for production would need things like robustness checks, retries, 
SolrCloud awareness, multi threaded feeding etc,
neither of which is present in SimplePostTool.java (post.jar).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 12. okt. 2015 kl. 06.05 skrev Zheng Lin Edwin Yeo :
> 
> Hi,
> 
> I am using Solr 5.3.0, and I would like to find out, is the logs for the
> indexing using post.jar stored anywhere in Solr?
> 
> I would need to know which files has been successfully indexed and which
> has not, so that I can re-run the indexing for those files which has not
> been indexed successfully due to various reasons.
> 
> Thank you.
> 
> Regards,
> Edwin



Re: Indexing logs when using post,jar

2015-10-12 Thread Zheng Lin Edwin Yeo
Hi Jan,

Thank you for your reply.
I've managed to direct the output to a log file.

As for production, which tool will you recommend to be used for indexing?

Regards,
Edwin


On 12 October 2015 at 15:36, Jan Høydahl  wrote:

> Hi
>
> The answer is no. When you run the tool you are responsible to redirect
> its output to file yourself if you want to keep it.
> Also, the tool is mostly meant as a quick way to post docs during
> development and testing, not for production.
> A tool built for production would need things like robustness checks,
> retries, SolrCloud awareness, multi threaded feeding etc,
> neither of which is present in SimplePostTool.java (post.jar).
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 12. okt. 2015 kl. 06.05 skrev Zheng Lin Edwin Yeo  >:
> >
> > Hi,
> >
> > I am using Solr 5.3.0, and I would like to find out, is the logs for the
> > indexing using post.jar stored anywhere in Solr?
> >
> > I would need to know which files has been successfully indexed and which
> > has not, so that I can re-run the indexing for those files which has not
> > been indexed successfully due to various reasons.
> >
> > Thank you.
> >
> > Regards,
> > Edwin
>
>


Spell Check and Privacy

2015-10-12 Thread Arnon Yogev
Hi,

Our system supports many users from different organizations and with 
different ACLs. 
We consider adding a spell check ("did you mean") functionality using 
DirectSolrSpellChecker. However, a privacy concern was raised, as this 
might lead to private information being revealed between users via the 
suggested terms. Using the FileBasedSpellChecker is another option, but 
naturally a static list of terms is not optimal.

Is there a best practice or a suggested method for these kind of cases?

Thanks,
Arnon


Re: Using SimpleNaiveBayesClassifier in solr

2015-10-12 Thread Tommaso Teofili
Hi Yewint,

the SNB classifier is not an online one, so you should retrain it every
time you want to update it.
What you pass to the Classifier is a Reader therefore you should grant that
this keeps being accessible (not close it) for classification to work.
Regarding performance SNB becomes slower as the no. of classes (labels)
increases as per the naive bayes algorithm scans through all the classes
and chooses the one with highest probability.
Depending on how big your index is you might want to make the classifier
use an index that's not accessed by other Lucene / Solr threads to avoid
impacting such other processes (e.g. indexing / search).

Hope this helps, if you have any further questions just ask.

Regards,
Tommaso



2015-10-10 21:27 GMT+02:00 Yewint Ko :

> Hi
>
> I am trying to use NaiveBayesClassifier in my solr project. Currently
> looking at its test case ClassificationTestBase.java.
>
> Below codes seems like that classifier read the whole index db to train the
> model everytime when classification happened for inputDocument. or am I
> misunderstanding something here? If i had a large index db, will it impact
> performance?
>
> protected void checkCorrectClassification(Classifier classifier, String
> inputDoc, T expectedResult, Analyzer analyzer, String textFieldName, String
> classFieldName, Query query) throws Exception {
>
> AtomicReader atomicReader = null;
>
> try {
>
>   populateSampleIndex(analyzer);
>
>   atomicReader = SlowCompositeReaderWrapper.wrap(indexWriter
> .getReader());
>
>   classifier.train(atomicReader, textFieldName, classFieldName,
> analyzer,
> query);
>
>   ClassificationResult classificationResult =
> classifier.assignClass(
> inputDoc);
>
>   assertNotNull(classificationResult.getAssignedClass());
>
>   assertEquals("got an assigned class of " +
> classificationResult.getAssignedClass(),
> expectedResult, classificationResult.getAssignedClass());
>
>   assertTrue("got a not positive score " +
> classificationResult.getScore(),
> classificationResult.getScore() > 0);
>
> } finally {
>
>   if (atomicReader != null)
>
> atomicReader.close();
>
> }
>
>   }
>


Re: Selective field query

2015-10-12 Thread Colin Hunter
Thanks Erick, I'm sure this will be valuable in implementing ngram filter
factory

On Fri, Oct 9, 2015 at 4:38 PM, Erick Erickson 
wrote:

> Colin:
>
> Adding =all to your query is your friend here, the
> parsed_query.toString will show you exactly what
> is searched against.
>
> Best,
> Erick
>
> On Fri, Oct 9, 2015 at 2:09 AM, Colin Hunter  wrote:
> > Ah ha...   the copy field...  makes sense.
> > Thank You.
> >
> > On Fri, Oct 9, 2015 at 10:04 AM, Upayavira  wrote:
> >
> >>
> >>
> >> On Fri, Oct 9, 2015, at 09:54 AM, Colin Hunter wrote:
> >> > Hi
> >> >
> >> > I am working on a complex search utility with an index created via
> data
> >> > import from an extensive MySQL database.
> >> > There are many ways in which the index is searched. One of the utility
> >> > input fields searches only on a Service Name. However, if I target the
> >> > query as q=ServiceName:"Searched service", this only returns an exact
> >> > string match. If q=Searched Service, the query still returns results
> from
> >> > all indexed data.
> >> >
> >> > Is there a way to construct a query to only return results from one
> field
> >> > of a doc ?
> >> > I have tried setting index=false, stored=true on unwanted fields, but
> >> > these
> >> > appear to have still been returned in results.
> >>
> >> q=ServiceName:(Searched Service)
> >>
> >> That'll look in just one field.
> >>
> >> Remember changing indexed to false doesn't impact the stuff already in
> >> your index. And the reason you are likely getting all that stuff is
> >> because you have a copyField that copies it over into the 'text' field.
> >> If you'll never want to search on some fields, switch them to
> >> index=false, make sure you aren't doing a copyField on them, and then
> >> reindex.
> >>
> >> Upayavira
> >>
> >
> >
> >
> > --
> > www.gfc.uk.net
>



-- 
www.gfc.uk.net


How to formulate query

2015-10-12 Thread Prasanna S. Dhakephalkar
Hi,

 

I am trying to make a solr search query to get result as under I am unable
to get do

 

I have a search term say "pit"

The result should have (in that order)

 

All docs that have "pit" as first WORD in search field  (pit\ *)+

All docs that have first WORD that starts with "pit"  (pit*\  *)+

All docs that have "pit" as WORD anywhere in search field  (except first)
(*\ pit\ *)+

All docs that have  a WORD starting with "pit" anywhere in search field
(except first) (*\ pit*\ *)+

All docs that have "pit" as string anywhere in the search field except cases
covered above (*pit*)

 

Example :

 

Pit the pat 

Pit digger

Pitch ball

Pitcher man

Dig a pit with shovel

Why do you want to dig a pit with shovel

Cricket pitch is 22 yards

What is pithy, I don't know

Per capita income

Epitome of blah blah

 

 

How can I achieve this ?

 

Regards,

 

Prasanna.

 



RE: NullPointerException

2015-10-12 Thread Duck Geraint (ext) GBJH
"When I use the Admin UI (v5.3.0), and check the spellcheck.build box"
Out of interest, where is this option within the Admin UI? I can't find 
anything like it in mine...

Do you get the same issue by submitting the build command directly with 
something like this instead:
http://localhost:8983/solr//ELspell?spellcheck.build=true
?

It'll be reasonably obvious if the dictionary has actually built or not by the 
file size of your speller store:
/localapps/dev/EventLog/solr/EventLog2/data/spFile


Otherwise, (temporarily) try adding...
true
...to your spellchecker search component config, you might find it'll log a 
more useful error message that way.

Geraint

Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com

-Original Message-
From: Mark Fenbers [mailto:mark.fenb...@noaa.gov]
Sent: 10 October 2015 20:03
To: solr User Group
Subject: NullPointerException

Greetings!

I'm new to Solr Spellchecking...  I have yet to get it to work.

Attached is a snippet from my solrconfig.xml pertaining to my spellcheck 
efforts.

When I use the Admin UI (v5.3.0), and check the spellcheck.build box, I get a 
NullPointerException stacktrace.  The actual stacktrace is at the bottom of the 
attachment.  My spellcheck.q is the following:
Solr will yuse suggestions frum both.

The FileBasedSpellChecker.build method is clearly the problem (determined from 
the stack trace), but I cannot figure out why.

Maybe I don't need to do a build on it...(?)  If I don't, the spell-checker 
finds no mispelled words.  yet, "yuse" and "frum" are not stand-alone words in 
/usr/share/dict/words.

/usr/share/dict/words exists and has global read permissions.  I displayed the 
file and see no issues (i.e., one word per line) although some "words" are a 
string of digits, but that shouldn't matter.

Does my snippet give any clues about why I would get this error? Is my stripped 
down configuration missing something, perhaps?

Mark




Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta 
Limited, European Regional Centre, Priestley Road, Surrey Research Park, 
Guildford, Surrey, GU2 7YH, United Kingdom

 This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.


Re: Solr cross core join special condition

2015-10-12 Thread Ali Nazemian
Thank you very much.

Sincerely yours.

On Mon, Oct 12, 2015 at 6:15 AM, Susheel Kumar 
wrote:

> Yes, Ali.  These are targeted for Solr 6 but you have the option download
> source from trunk, build it and try out these features if that helps in the
> meantime.
>
> Thanks
> Susheel
>
> On Sun, Oct 11, 2015 at 10:01 AM, Ali Nazemian 
> wrote:
>
> > Dear Susheel,
> > Hi,
> >
> > I did check the jira issue that you mentioned but it seems its target is
> > Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not
> found.
> > For Solr 5.x should I try to implement something similar myself?
> >
> > Sincerely yours.
> >
> >
> > On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar 
> > wrote:
> >
> > > You may want to take a look at new Solr feature of Streaming API &
> > > Expressions
> > > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
> > > for making joins between collections.
> > >
> > > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:
> > >
> > > > I developed a join transformer plugin that did that (although it
> didn't
> > > > flatten the results like that).  The one thing that was painful about
> > it
> > > is
> > > > that the TextResponseWriter has references to both the IndexSchema
> and
> > > > SolrReturnFields objects for the primary core.  So when you add a
> > > > SolrDocument from another core it returned the wrong fields.  I
> worked
> > > > around that by transforming the SolrDocument to a NamedList.  Then
> when
> > > it
> > > > gets to processing the IndexableFields it uses the wrong
> IndexSchema, I
> > > > worked around that by transforming each field to a hard Java object
> > > > (through the IndexSchema and FieldType of the correct core).  I think
> > it
> > > > would be great to patch TextResponseWriter with multi core writing
> > > > abilities, but there is one question, how can it tell which core a
> > > > SolrDocument or IndexableField is from?  Seems we'd have to add an
> > > > attribute for that.
> > > >
> > > > The other possibly simpler thing to do is execute the join at index
> > time
> > > > with an update processor.
> > > >
> > > > Ryan
> > > >
> > > > On Tuesday, October 6, 2015, Mikhail Khludnev <
> > > mkhlud...@griddynamics.com>
> > > > wrote:
> > > >
> > > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian <
> alinazem...@gmail.com
> > > > > > wrote:
> > > > >
> > > > > > it
> > > > > > seems there is not any way to do that right now and it should be
> > > > > developed
> > > > > > somehow. Am I right?
> > > > > >
> > > > >
> > > > > yep
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > Principal Engineer,
> > > > > Grid Dynamics
> > > > >
> > > > > 
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian


Re: How to use FuzzyQuery in schema.xml

2015-10-12 Thread Upayavira
The fuzzy query does not need mentioning in schema.xml. a search for
Steve~ or Steve~0.5 will trigger a fuzzy query.

Upayavira 

On Sat, Oct 10, 2015, at 08:27 PM, vit wrote:
> I am using Solr 4.2
> For some reason I cannot find an example of  FuzzyQuery
> filter in schema.xml.
> Maybe I am in a wrong path ? But all I need is to apply "edit distance"
> similarity in my fuzzy search.
> Please help me figure out. 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-use-FuzzyQuery-in-schema-xml-tp4233900.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Pagination

2015-10-12 Thread Toke Eskildsen
On Mon, 2015-10-12 at 10:05 +0200, Jan Høydahl wrote:
> What you do when you call optimize is to force Lucene to merge all
> those 35M docs into ONE SINGLE index segment. You get better HW
> utilization if you let Lucene/Solr automatically handle merging,
> meaning you’ll have around 10 smaller segments that are faster to
> search across than one huge segment.

As individual Lucene/Solr shard searches are very much single threaded,
the single segment version should be faster. Have you observed
otherwise?


Optimization is a fine feature if ones workflow is batch oriented with
sufficiently long pauses between index updates. Nightly index updates
with few active users at that time could be an example.

- Toke Eskildsen, State and University Library, Denmark




Re: NullPointerException

2015-10-12 Thread Mark Fenbers

On 10/12/2015 5:38 AM, Duck Geraint (ext) GBJH wrote:

"When I use the Admin UI (v5.3.0), and check the spellcheck.build box"
Out of interest, where is this option within the Admin UI? I can't find 
anything like it in mine...
This is in the expanded options that open up once I put a checkmark in 
the "spellcheck" box.

Do you get the same issue by submitting the build command directly with 
something like this instead:
http://localhost:8983/solr//ELspell?spellcheck.build=true
?

Yes, I do.

It'll be reasonably obvious if the dictionary has actually built or not by the 
file size of your speller store:
/localapps/dev/EventLog/solr/EventLog2/data/spFile


Otherwise, (temporarily) try adding...
true
...to your spellchecker search component config, you might find it'll log a 
more useful error message that way.
Interesting!  The index builds successfully using this method and I get 
no stacktrace error.  Hurray!  But why??


So now, I tried running a query, so I typed Fenbers into the 
spellcheck.q box, and I get the following 9 suggestions:

fenber
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

I find this very odd because I commented out all references to the 
wordbreak checker in solrconfig.xml.  What do I configure so that Solr 
will give me sensible suggestions like:

  fenders
  embers
  fenberry
and so on?

Mark



Re: Solr cross core join special condition

2015-10-12 Thread Ali Nazemian
Dear Shawn,
Hi,

Since in Yonki's Solr blog  it is mentioned
that this feature is one of the Solr 5.4 features. I assume it will
back-ported to the next stable release (5.4). Please correct me if it is
the wrong assumption.
Thank you very much.

Sincerely yours.

On Mon, Oct 12, 2015 at 12:29 PM, Ali Nazemian 
wrote:

> Thank you very much.
>
> Sincerely yours.
>
> On Mon, Oct 12, 2015 at 6:15 AM, Susheel Kumar 
> wrote:
>
>> Yes, Ali.  These are targeted for Solr 6 but you have the option download
>> source from trunk, build it and try out these features if that helps in
>> the
>> meantime.
>>
>> Thanks
>> Susheel
>>
>> On Sun, Oct 11, 2015 at 10:01 AM, Ali Nazemian 
>> wrote:
>>
>> > Dear Susheel,
>> > Hi,
>> >
>> > I did check the jira issue that you mentioned but it seems its target is
>> > Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not
>> found.
>> > For Solr 5.x should I try to implement something similar myself?
>> >
>> > Sincerely yours.
>> >
>> >
>> > On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar 
>> > wrote:
>> >
>> > > You may want to take a look at new Solr feature of Streaming API &
>> > > Expressions
>> > > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
>> > > for making joins between collections.
>> > >
>> > > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:
>> > >
>> > > > I developed a join transformer plugin that did that (although it
>> didn't
>> > > > flatten the results like that).  The one thing that was painful
>> about
>> > it
>> > > is
>> > > > that the TextResponseWriter has references to both the IndexSchema
>> and
>> > > > SolrReturnFields objects for the primary core.  So when you add a
>> > > > SolrDocument from another core it returned the wrong fields.  I
>> worked
>> > > > around that by transforming the SolrDocument to a NamedList.  Then
>> when
>> > > it
>> > > > gets to processing the IndexableFields it uses the wrong
>> IndexSchema, I
>> > > > worked around that by transforming each field to a hard Java object
>> > > > (through the IndexSchema and FieldType of the correct core).  I
>> think
>> > it
>> > > > would be great to patch TextResponseWriter with multi core writing
>> > > > abilities, but there is one question, how can it tell which core a
>> > > > SolrDocument or IndexableField is from?  Seems we'd have to add an
>> > > > attribute for that.
>> > > >
>> > > > The other possibly simpler thing to do is execute the join at index
>> > time
>> > > > with an update processor.
>> > > >
>> > > > Ryan
>> > > >
>> > > > On Tuesday, October 6, 2015, Mikhail Khludnev <
>> > > mkhlud...@griddynamics.com>
>> > > > wrote:
>> > > >
>> > > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian <
>> alinazem...@gmail.com
>> > > > > > wrote:
>> > > > >
>> > > > > > it
>> > > > > > seems there is not any way to do that right now and it should be
>> > > > > developed
>> > > > > > somehow. Am I right?
>> > > > > >
>> > > > >
>> > > > > yep
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Sincerely yours
>> > > > > Mikhail Khludnev
>> > > > > Principal Engineer,
>> > > > > Grid Dynamics
>> > > > >
>> > > > > 
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>> >
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian


Re: admin-extra

2015-10-12 Thread Upayavira

Do you use it? If so, how?

Upayavira 
On Mon, Oct 12, 2015, at 02:05 AM, Bill Au wrote:
> admin-extra allows one to include additional links and/or information in
> the Solr admin main page:
> 
> https://cwiki.apache.org/confluence/display/solr/Core-Specific+Tools
> 
> Bill
> 
> On Wed, Oct 7, 2015 at 5:40 PM, Upayavira  wrote:
> 
> > Do you use admin-extra within the admin UI?
> >
> > If so, please go to [1] and document your use case. The feature
> > currently isn't implemented in the new admin UI, and without use-cases,
> > it likely won't be - so if you want it in there, please help us
> > understand how you use it!
> >
> > Thanks!
> >
> > Upayavira
> >
> > [1] https://issues.apache.org/jira/browse/SOLR-8140
> >


Re: Using SimpleNaiveBayesClassifier in solr

2015-10-12 Thread Alessandro Benedetti
Hi Yewint,
>
> The sample test code inside seems like that classifier read the whole index
> db to train the model everytime when classification happened for
> inputDocument. or am I misunderstanding something here?


I would suggest you to take a look to a couple of articles I wrote last
summer about the Classification in Lucene and Solr :

http://alexbenedetti.blogspot.co.uk/2015/07/lucene-document-classification.html

http://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html

Basically your misunderstood is that this module work as standard
classifier, which is not our case.
Lucene Classification doesn't train a model over time, the Index is your
model.
It uses the Index data structures to perform the classification processes
(Knn and Simple Bayes are the algorithms I explored at that time) .
Basically the algorithms access to Term Frequencies and Document
Frequencies stored in the Inverted index.

Having a big Index will affect as of course we are querying the index, but
not because we are building a model.

+1 on all Tommaso's observations!

Cheers



On 10 October 2015 at 20:36, Yewint Ko  wrote:

> Hi
>
> I am trying to use SimpleNaiveBayesClassifier in my solr project. Currently
> looking at its test base ClassificationTestBase.java.
>
> The sample test code inside seems like that classifier read the whole index
> db to train the model everytime when classification happened for
> inputDocument. or am I misunderstanding something here? If i had a large
> index db, will it impact performance?
>
> protected void checkCorrectClassification(Classifier classifier, String
> inputDoc, T expectedResult, Analyzer analyzer, String textFieldName, String
> classFieldName, Query query) throws Exception {
>
> AtomicReader atomicReader = null;
>
> try {
>
>   populateSampleIndex(analyzer);
>
>   atomicReader = SlowCompositeReaderWrapper.wrap(indexWriter
> .getReader());
>
>   classifier.train(atomicReader, textFieldName, classFieldName,
> analyzer,
> query);
>
>   ClassificationResult classificationResult =
> classifier.assignClass(
> inputDoc);
>
>   assertNotNull(classificationResult.getAssignedClass());
>
>   assertEquals("got an assigned class of " +
> classificationResult.getAssignedClass(),
> expectedResult, classificationResult.getAssignedClass());
>
>   assertTrue("got a not positive score " +
> classificationResult.getScore(),
> classificationResult.getScore() > 0);
>
> } finally {
>
>   if (atomicReader != null)
>
> atomicReader.close();
>
> }
>
>   }
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: How to use FuzzyQuery in schema.xml

2015-10-12 Thread vit
Thanks Upayavira for clarification. This works for one token query, but when
I try it in a multi tokens like 
"Home Builders~" or "Home Builders~0.5" it does not work. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-FuzzyQuery-in-schema-xml-tp4233900p4234106.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: catchall fields or multiple fields

2015-10-12 Thread Jack Krupansky
I think it may all depend on the nature of your application and how much
commonality there is between fields.

One interesting area is auto-suggest, where you can certainly suggest from
the union of all fields, you may want to give priority to suggestions from
preferred fields. For example, for actual product names or important
keywords rather than random words from the English language that happen to
occur in descriptions, all of which would occur in a catchall.

-- Jack Krupansky

On Mon, Oct 12, 2015 at 8:39 AM, elisabeth benoit  wrote:

> Hello,
>
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
>
> Best regards,
> Elisabeth
>


Re: Replication and soft commits for NRT searches

2015-10-12 Thread Erick Erickson
First of all, setting soft commit with maxDocs=1 is almost (but not
quite) guaranteed to lead to problems. For _every_ document you add to
Solr, all your top-level caches (i.e. the ones configured in
solrconrig.xml) will be thrown away, all autowarming will be performed
etc. Essentially assuming a constant indexing load none of your
top-level caches are doing you any good.

This might help:
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

By the time an indexing request returns, the document(s) have all been
forwarded to all replicas and indexed to the in-memory structures
_and_ written to the tlog. The next expiration of the soft commit
interval will allow them to be searched, assuming that autowarming is
completed.

I'm going to guess that you'll see a bunch of warnings like
"overlapping ondeck searchers" and you'll be tempted to set
maxWarmingSearchers to some number greater than 2 in solrconfig.xml. I
recommend against this too, that setting is there for a reason.

Do you have any evidence of a problem or is this theoretical?

All that said, I would _strongly_ urge you to revisit the requirement
of having your soft commit maxDocs set to 1.

Best,
Erick

On Mon, Oct 12, 2015 at 1:01 AM, MOIS Martin (MORPHO)
 wrote:
> Hello,
>
> I am running Solr 5.2.1 in a cluster with 6 nodes. My collections have been 
> created with replicationFactor=2, i.e. I have one replica for each shard. 
> Beyond that I am using autoCommit/maxDocs=1 and autoSoftCommits/maxDocs=1 
> in order to achieve near realtime search behavior.
>
> As far as I understand from section "Write Side Fault Tolerance" in the 
> documentation 
> (https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance),
>  I cannot enforce that an update gets replicated to all replicas, but I can 
> only get the achieved replication factor by requesting the return value rf.
>
> My question is now, what exactly does rf=2 mean? Does it only mean that the 
> replica has written the update to its transaction log? Or has the replica 
> also performed the soft commit as configured with autoSoftCommits/maxDocs=1? 
> The answer is important for me, as if the update would only get written to 
> the transaction log, I could not search for it reliable, as the replica may 
> not have added it to the searchable index.
>
> My second question is, does rf=1 mean that the update was definitely not 
> successful on the replica or could it also represent a timeout of the 
> replication request from the shard leader? If it could also represent a 
> timeout, then there would be a small chance that the replication was 
> successfully despite of the timeout.
>
> Is there a way to retrieve the replication factor for a specific document 
> after the update in order to check if replication was successful in the 
> meantime?
>
> Thanks in advance.
>
> Best Regards,
> Martin Mois
> #
> " This e-mail and any attached documents may contain confidential or 
> proprietary information. If you are not the intended recipient, you are 
> notified that any dissemination, copying of this e-mail and any attachments 
> thereto or use of their contents by any means whatsoever is strictly 
> prohibited. If you have received this e-mail in error, please advise the 
> sender immediately and delete this e-mail and all attached documents from 
> your computer system."
> #


Replication and soft commits for NRT searches

2015-10-12 Thread MOIS Martin (MORPHO)
Hello,

I am running Solr 5.2.1 in a cluster with 6 nodes. My collections have been 
created with replicationFactor=2, i.e. I have one replica for each shard. 
Beyond that I am using autoCommit/maxDocs=1 and autoSoftCommits/maxDocs=1 
in order to achieve near realtime search behavior.

As far as I understand from section "Write Side Fault Tolerance" in the 
documentation 
(https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance),
 I cannot enforce that an update gets replicated to all replicas, but I can 
only get the achieved replication factor by requesting the return value rf.

My question is now, what exactly does rf=2 mean? Does it only mean that the 
replica has written the update to its transaction log? Or has the replica also 
performed the soft commit as configured with autoSoftCommits/maxDocs=1? The 
answer is important for me, as if the update would only get written to the 
transaction log, I could not search for it reliable, as the replica may not 
have added it to the searchable index.

My second question is, does rf=1 mean that the update was definitely not 
successful on the replica or could it also represent a timeout of the 
replication request from the shard leader? If it could also represent a 
timeout, then there would be a small chance that the replication was 
successfully despite of the timeout.

Is there a way to retrieve the replication factor for a specific document after 
the update in order to check if replication was successful in the meantime?

Thanks in advance.

Best Regards,
Martin Mois
#
" This e-mail and any attached documents may contain confidential or 
proprietary information. If you are not the intended recipient, you are 
notified that any dissemination, copying of this e-mail and any attachments 
thereto or use of their contents by any means whatsoever is strictly 
prohibited. If you have received this e-mail in error, please advise the sender 
immediately and delete this e-mail and all attached documents from your 
computer system."
#


Re: No live SolrServers available to handle this request

2015-10-12 Thread Steve
Thanks Mark,

I rebuilt and made sure the versions matched.  It works.
Not sure how that happened tho..

thx.
.strick

On Thu, Oct 8, 2015 at 4:31 PM, Mark Miller  wrote:

> Your Lucene and Solr versions must match.
>
> On Thu, Oct 8, 2015 at 4:02 PM Steve  wrote:
>
> > I've loaded the Films data into a 4 node cluster.  Indexing went well,
> but
> > when I issue a query, I get this:
> >
> > "error": {
> > "msg": "org.apache.solr.client.solrj.SolrServerException: No live
> > SolrServers available to handle this request:
> > [
> >
> >
> http://host-192-168-0-63.openstacklocal:8081/solr/CollectionFilms_shard1_replica2
> > ,
> >
> >
> >
> http://host-192-168-0-62.openstacklocal:8081/solr/CollectionFilms_shard2_replica2
> > ,
> >
> >
> >
> http://host-192-168-0-60.openstacklocal:8081/solr/CollectionFilms_shard2_replica1
> > ]",
> > ...
> >
> > and further down in the stacktrace:
> >
> > Server Error
> > Caused by:
> > java.lang.NoSuchMethodError:
> >
> >
> org.apache.lucene.index.TermsEnum.postings(Lorg/apache/lucene/index/PostingsEnum;I)Lorg/apache/lucene/index/PostingsEnum;\n\tat
> >
> >
> org.apache.solr.search.SolrIndexSearcher.getFirstMatch(SolrIndexSearcher.java:802)\n\tat
> >
> >
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:333)\n\tat
> > ...
> >
> >
> > I'm using:
> >
> > solr version 5.3.1
> >
> > lucene 5.2.1
> >
> > zookeeper version 3.4.6
> >
> > indexing with:
> >
> >cd /opt/solr/example/films;
> >
> > /opt/solr/bin/post -c CollectionFilms -port 8081  films.json
> >
> >
> >
> > thx,
> > .strick
> >
> --
> - Mark
> about.me/markrmiller
>


Re: catchall fields or multiple fields

2015-10-12 Thread Trey Grainger
Elisabeth,

Yes, it will almost always be more efficient to search within a catch-all
field than to search across multiple fields. Think of it this way: when you
search on a single field, you are doing a single keyword search against the
index per term. When you search across multiple fields, you are executing
the search for that term multiple times (once for each field) against the
index, and then doing the necessary intersections/unions/etc. of the
document sets.

As you continue to add more and more fields to search across, the search
continues to grow slower. If you're only searching a few fields then it
will probably not be noticeably slower, but the more and more you add, the
slower your response times will become. This slowdown may be measured in
milliseconds, in which case you may not care, but it will be slower.

The idf point you mentioned can be both a pro and a con depending upon the
use case. For example, if you are searching news content that has a
"french_text" field and an "english_text" field, it would be suboptimal if
for the search "Barack Obama" you got only French documents at the top
because the US president's name is much more commonly found in English
documents. When you're searching fields with different types of content,
however, you might find examples where you'd actually want idf differences
maintained and documents differentiated based upon underlying field.

One particularly nice thing about the multi-field approach is that it is
very easy to apply different boosts to the fields and to dynamically change
the boosts. You can similarly do this with payloads within a catch-all
field. You could even assign each term a payload corresponding to which
field the content came from, and then dynamically change the boosts
associated with those payloads at query time (caveat - custom code
required). See this blog post for an end-to-end payload scoring example,
https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/.


Sharing my personal experience: at CareerBuilder, we use the catch-all
field with payloads (one per underlying field) that we can dynamically
change the weight of at query time. We found that for most of our corpus
sizes (ranging between 2 and 100 million full text jobs or resumes), that
is is more efficient to search between 1 and 3 fields than to do the
multi-field search with payload scoring, but once we get to the 4th field
the extra cost associated with the payload scoring was overtaken by the
additional time required to search each additional field.   These numbers
(3 vs 4 fields, etc.) are all anecdotal, of course, as it is dependent upon
a lot of environmental and corpus factors unique to our use case.

The main point of this approach, however, is that there is no additional
cost per-field beyond the upfront cost to add and score payloads, so we
have been able to easily represent over a hundred of these payload-based
"virtual fields" with different weights within a catch-all field (all with
a fixed query-time cost).

*In summary*: yes, you should expect a performance decline as you add more
and more fields to your query if you are searching across multiple fields.
You can overcome this by using a single catch-all field if you are okay
losing IDF per-field (you'll still have it globally across all fields). If
you want to use a catch-all field, but still want to boost content based
upon the field it originated within, you can accomplish this with payloads.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder


On Mon, Oct 12, 2015 at 9:12 AM, Ahmet Arslan 
wrote:

> Hi,
>
> Catch-all field: No need to worry about how to aggregate scores coming
> from different fields.
> But you cannot utilize different analysers for different fields.
>
> Multiple-fields: You can play with edismax's parameters on-the-fly,
> without having to re-index.
> It is flexible that you can include/exclude fields from search.
>
> Ahmet
>
>
>
> On Monday, October 12, 2015 3:39 PM, elisabeth benoit <
> elisaelisael...@gmail.com> wrote:
> Hello,
>
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
>
> Best regards,
> Elisabeth
>


Re: How do I set up custom collection cores?

2015-10-12 Thread espeake




From:   Shawn Heisey 
To: solr-user@lucene.apache.org
Date:   10/09/2015 12:33 PM
Subject:Re: How do I set up custom collection cores?



On 10/9/2015 10:03 AM, espe...@oreillyauto.com wrote:
> We are installing Alfresco One 5.0.1 with solr4 on a server that has an
> existing instance of tomcat7.  I am trying to find some better
> documentation on how to setup our cores.  In the solr4.xml located



> Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in
> classpath or '/var/lib/tomcat7/solr/collection1/conf'
> at org.apache.solr.core.SolrResourceLoader.openResource
> (SolrResourceLoader.java:362)
> at org.apache.solr.core.SolrResourceLoader.openConfig
> (SolrResourceLoader.java:308)
> at org.apache.solr.core.Config.(Config.java:116)
> at org.apache.solr.core.Config.(Config.java:86)
> at org.apache.solr.core.SolrConfig.(SolrConfig.java:161)
> at org.apache.solr.core.SolrConfig.readFromResourceLoader
> (SolrConfig.java:144)

Solr can't find the config for the collection1 core.

> When I try to define a docBase in
> the /etc/tomcat7/Catalina/localhost/solr4.xml file catalina.out logs has
> this:

It looks like docBase needs to point to the war file.  If you want to
change where Solr puts its data, you need to define either the
solr.solr.home java system property (on the java commandline --
-Dsolr.solr.home=/my/path) or the solr/home JNDI property.

The other parts of this are atleast moving me forward.  I added the
-Dsolr.solr.home=/data/alfresco_5.0.1/alf_data/solr4 to my Java startup and
I still get:

WARNING: A docBase /var/lib/tomcat7/webapps/solr4.war inside the host
appBase has been specified, and will be ignored

Should I point the solr.home to /var/lib/tomcat7/webapps/solr4 ?

https://wiki.apache.org/solr/SolrTomcat#Configuring_Solr_Home_with_JNDI

Exactly what file/directory layout you need is dependent on the precise
Solr version and whether your solr.xml file (not the solr4.xml you
mentioned -- that's for your container and is not used by solr) is in
the new or old format.  The solr.xml file lives in the solr home directory.

http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond
https://wiki.apache.org/solr/Solr.xml%20%28supported%20through%204.x%29

Thanks,
Shawn



This communication and any attachments are confidential, protected by 
Communications Privacy Act 18 USCS § 2510, solely for the use of the intended 
recipient, and may contain legally privileged material. If you are not the 
intended recipient, please return or destroy it immediately. Thank you.


catchall fields or multiple fields

2015-10-12 Thread elisabeth benoit
Hello,

We're using solr 4.10 and storing all data in a catchall field. It seems to
me that one good reason for using a catchall field is when using scoring
with idf (with idf, a word might not have same score in all fields). We got
rid of idf and are now considering using multiple fields. I remember
reading somewhere that using a catchall field might speed up searching
time. I was wondering if some of you have any opinion (or experience)
related to this subject.

Best regards,
Elisabeth


RE: Spell Check and Privacy

2015-10-12 Thread Dyer, James
Arnon,

Use "spellcheck.collate=true" with "spellcheck.maxCollationTries" set to a 
non-zero value.  This will give you re-written queries that are guaranteed to 
return hits, given the original query and filters.  If you are using an "mm" 
value other than 100%, you also will want specify 
"spellcheck.collateParam.mm=100%". (or if using "q.op=OR", then use 
"spellcheck.collateParam.q.op=AND")

Of course, the first section of the spellcheck result will still show every 
possible suggestion, so your client needs to discard these and not divulge them 
to the user.  If you need to know word-by-word how the collations were 
constructed, then specify "spellcheck.collateExtendedResults=true".  Use the 
extended collation results for this information and not the first section of 
the spellcheck results.

This is all fairly well-documented on the old solr wiki:  
https://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

James Dyer
Ingram Content Group

-Original Message-
From: Arnon Yogev [mailto:arn...@il.ibm.com] 
Sent: Monday, October 12, 2015 2:33 AM
To: solr-user@lucene.apache.org
Subject: Spell Check and Privacy

Hi,

Our system supports many users from different organizations and with 
different ACLs. 
We consider adding a spell check ("did you mean") functionality using 
DirectSolrSpellChecker. However, a privacy concern was raised, as this 
might lead to private information being revealed between users via the 
suggested terms. Using the FileBasedSpellChecker is another option, but 
naturally a static list of terms is not optimal.

Is there a best practice or a suggested method for these kind of cases?

Thanks,
Arnon



Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

2015-10-12 Thread RohanaR
Has this been fixed now so that phrase queries given in double quotes work? I
am trying this and encountered the same problem due to original order of
tokens in the index are not preserved. How can I fix this (if not fixed
yet)?

RohanaR



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4234058.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

2015-10-12 Thread RohanaR
Has this been fixed now so that phrase queries given in double quotes work? I
am trying this and encountered the same problem due to original order of
tokens in the index are not preserved. How can I fix this (if not fixed
yet)? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4234059.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: catchall fields or multiple fields

2015-10-12 Thread Ahmet Arslan
Hi,

Catch-all field: No need to worry about how to aggregate scores coming from 
different fields.
But you cannot utilize different analysers for different fields.

Multiple-fields: You can play with edismax's parameters on-the-fly, without 
having to re-index.
It is flexible that you can include/exclude fields from search.

Ahmet



On Monday, October 12, 2015 3:39 PM, elisabeth benoit 
 wrote:
Hello,

We're using solr 4.10 and storing all data in a catchall field. It seems to
me that one good reason for using a catchall field is when using scoring
with idf (with idf, a word might not have same score in all fields). We got
rid of idf and are now considering using multiple fields. I remember
reading somewhere that using a catchall field might speed up searching
time. I was wondering if some of you have any opinion (or experience)
related to this subject.

Best regards,
Elisabeth


Fwd: Grouping facets: Possible to get facet results for each Group?

2015-10-12 Thread Peter Sturge
Hello Solr Forum,

Been trying to coerce Group faceting to give some faceting back for each
group, but maybe this use case isn't catered for in Grouping? :

So the Use Case is this:
Let's say I do a grouped search that returns say, 9 distinct groups, and in
these groups are various numbers of unique field values that need faceting
- but the faceting needs to be within each group:


user:*=true=user=true=host=true

This query gives back grouped facets for each 'host' value (i.e. the facet
counts are 'collapsed') - but the facet counts (unique values of 'user'
field) are aggregated for all the groups, not on a 'per-group' basis (i.e.
returned as 'global facets' - outside of the grouped results).
The results from the query above doesn't say which unique values for
'users' are in which group. If the number of doc hits is very large (can
easily be in the 100's of thousands) it's not practical to iterate through
the docs looking for unique values.
This Use Case necessitates the unique values within each group, rather than
the total doc hits.

Is this possible with grouping, or inconjunction with another module?

Many thanks,
+Peter


Re: Spell Check and Privacy

2015-10-12 Thread Susheel Kumar
Hi Arnon,

I couldn't fully understood your use case regarding Privacy. Are you
concerned that SpellCheck may reveal user names part of suggestions which
could have belonged to different organizations / ACLS OR after providing
suggestions you are concerned that user may be able to click and view other
organization users?

Please provide some details on your concern for Privacy with Spell Checker.

Thanks,
Susheel

On Mon, Oct 12, 2015 at 9:45 AM, Dyer, James 
wrote:

> Arnon,
>
> Use "spellcheck.collate=true" with "spellcheck.maxCollationTries" set to a
> non-zero value.  This will give you re-written queries that are guaranteed
> to return hits, given the original query and filters.  If you are using an
> "mm" value other than 100%, you also will want specify "
> spellcheck.collateParam.mm=100%". (or if using "q.op=OR", then use
> "spellcheck.collateParam.q.op=AND")
>
> Of course, the first section of the spellcheck result will still show
> every possible suggestion, so your client needs to discard these and not
> divulge them to the user.  If you need to know word-by-word how the
> collations were constructed, then specify
> "spellcheck.collateExtendedResults=true".  Use the extended collation
> results for this information and not the first section of the spellcheck
> results.
>
> This is all fairly well-documented on the old solr wiki:
> https://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
>
> James Dyer
> Ingram Content Group
>
> -Original Message-
> From: Arnon Yogev [mailto:arn...@il.ibm.com]
> Sent: Monday, October 12, 2015 2:33 AM
> To: solr-user@lucene.apache.org
> Subject: Spell Check and Privacy
>
> Hi,
>
> Our system supports many users from different organizations and with
> different ACLs.
> We consider adding a spell check ("did you mean") functionality using
> DirectSolrSpellChecker. However, a privacy concern was raised, as this
> might lead to private information being revealed between users via the
> suggested terms. Using the FileBasedSpellChecker is another option, but
> naturally a static list of terms is not optimal.
>
> Is there a best practice or a suggested method for these kind of cases?
>
> Thanks,
> Arnon
>
>


Re: catchall fields or multiple fields

2015-10-12 Thread Walter Underwood
Why get rid of idf? Most often, idf is a big help in relevance.

I’ve used different weights for different parts of the document, like weighting 
the title 8X the body.

I’ve used different weights for different analysis chains. If we have three 
fields, one lowercased, one stemmed, and one a phonetic representation, then 
you can weight the lower case higher than the stemmed field, and stemmed higher 
than phonetic.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 12, 2015, at 6:12 AM, Ahmet Arslan  wrote:
> 
> Hi,
> 
> Catch-all field: No need to worry about how to aggregate scores coming from 
> different fields.
> But you cannot utilize different analysers for different fields.
> 
> Multiple-fields: You can play with edismax's parameters on-the-fly, without 
> having to re-index.
> It is flexible that you can include/exclude fields from search.
> 
> Ahmet
> 
> 
> 
> On Monday, October 12, 2015 3:39 PM, elisabeth benoit 
>  wrote:
> Hello,
> 
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
> 
> Best regards,
> Elisabeth



Re: How to formulate query

2015-10-12 Thread Erick Erickson
Nothing exists currently that would do this. I would urge you to revisit the
requirements, this kind of super-specific ordering is often not worth the
effort to try to enforce, how does the _user_ benefit here?

Best,
Erick

On Mon, Oct 12, 2015 at 12:47 AM, Prasanna S. Dhakephalkar
 wrote:
> Hi,
>
>
>
> I am trying to make a solr search query to get result as under I am unable
> to get do
>
>
>
> I have a search term say "pit"
>
> The result should have (in that order)
>
>
>
> All docs that have "pit" as first WORD in search field  (pit\ *)+
>
> All docs that have first WORD that starts with "pit"  (pit*\  *)+
>
> All docs that have "pit" as WORD anywhere in search field  (except first)
> (*\ pit\ *)+
>
> All docs that have  a WORD starting with "pit" anywhere in search field
> (except first) (*\ pit*\ *)+
>
> All docs that have "pit" as string anywhere in the search field except cases
> covered above (*pit*)
>
>
>
> Example :
>
>
>
> Pit the pat
>
> Pit digger
>
> Pitch ball
>
> Pitcher man
>
> Dig a pit with shovel
>
> Why do you want to dig a pit with shovel
>
> Cricket pitch is 22 yards
>
> What is pithy, I don't know
>
> Per capita income
>
> Epitome of blah blah
>
>
>
>
>
> How can I achieve this ?
>
>
>
> Regards,
>
>
>
> Prasanna.
>
>
>


Re: How to formulate query

2015-10-12 Thread Mikhail Khludnev
Hello,
Even number of word can be used as scoring factor, but just for beginning.
You can cut the first word into separate field with _field mutating update
processor_ see
http://lucene.apache.org/solr/5_3_1/solr-core/org/apache/solr/update/processor/FieldMutatingUpdateProcessorFactory.html
then you can use query syntax to specify field:, boost^10 and wildcard*.
All such clauses might be combited with spaces as optional ones.
http://lucene.apache.org/core/5_2_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html


On Mon, Oct 12, 2015 at 12:47 AM, Prasanna S. Dhakephalkar <
prasann...@merajob.in> wrote:

> Hi,
>
>
>
> I am trying to make a solr search query to get result as under I am unable
> to get do
>
>
>
> I have a search term say "pit"
>
> The result should have (in that order)
>
>
>
> All docs that have "pit" as first WORD in search field  (pit\ *)+
>
> All docs that have first WORD that starts with "pit"  (pit*\  *)+
>
> All docs that have "pit" as WORD anywhere in search field  (except first)
> (*\ pit\ *)+
>
> All docs that have  a WORD starting with "pit" anywhere in search field
> (except first) (*\ pit*\ *)+
>
> All docs that have "pit" as string anywhere in the search field except
> cases
> covered above (*pit*)
>
>
>
> Example :
>
>
>
> Pit the pat
>
> Pit digger
>
> Pitch ball
>
> Pitcher man
>
> Dig a pit with shovel
>
> Why do you want to dig a pit with shovel
>
> Cricket pitch is 22 yards
>
> What is pithy, I don't know
>
> Per capita income
>
> Epitome of blah blah
>
>
>
>
>
> How can I achieve this ?
>
>
>
> Regards,
>
>
>
> Prasanna.
>
>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





AutoComplete Feature in Solr

2015-10-12 Thread Salman Ansari
Hi,

I have been trying to get the autocomplete feature in Solr working with no
luck up to now. First I read that "suggest component" is the recommended
way as in the below article (and this is the exact functionality I am
looking for, which is to autocomplete multiple words)
http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/

Then I tried implementing suggest as described in the following articles in
this order
1) https://wiki.apache.org/solr/Suggester#SearchHandler_configuration
2) http://solr.pl/en/2010/11/15/solr-and-autocomplete-part-2/  (I
implemented suggesting phrases)
3)
http://stackoverflow.com/questions/18132819/how-to-have-solr-autocomplete-on-whole-phrase-when-query-contains-multiple-terms

With no luck, after implementing each article when I run my query as
http://[MySolr]:8983/solr/entityStore114/suggest?spellcheck.q=Barack



I get


0
0



 Although I have an entry for Barack Obama in my index. I am posting my
Solr configuration as well


 
  suggest
  org.apache.solr.spelling.suggest.Suggester
  org.apache.solr.spelling.suggest.fst.FSTLookup
  entity_autocomplete
true
 


 
 
  true
  suggest
  10
true
false
 
 
  suggest
 


It looks like a very simple job, but even after following so many articles,
I could not get it right. Any comment will be appreciated!

Regards,
Salman


EdgeNGramFilterFactory for phrases

2015-10-12 Thread vit
I use Solr 4.2
I creted a field with the following analyzer :
 
 


for both index and search.
Maybe KStem is an overkill but I do not think it is important here. 

On phrase search "Peak physical" it returns result:
"Peak Physical Therapy Physical Therapy Of Brooklyn"

For "Peak Physica"
it returns the same result, 

BUT for "Pea Physical"
it does not return anything, Why?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNGramFilterFactory-for-phrases-tp4234168.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to formulate query

2015-10-12 Thread Susheel Kumar
Hi Prassana, This is a highly custom relevancy/ordering requirement and one
possible way you can try is by creating multiple fields and coming up with
query for each of the searches and boost them accordingly.

Thnx

On Mon, Oct 12, 2015 at 12:50 PM, Erick Erickson 
wrote:

> Nothing exists currently that would do this. I would urge you to revisit
> the
> requirements, this kind of super-specific ordering is often not worth the
> effort to try to enforce, how does the _user_ benefit here?
>
> Best,
> Erick
>
> On Mon, Oct 12, 2015 at 12:47 AM, Prasanna S. Dhakephalkar
>  wrote:
> > Hi,
> >
> >
> >
> > I am trying to make a solr search query to get result as under I am
> unable
> > to get do
> >
> >
> >
> > I have a search term say "pit"
> >
> > The result should have (in that order)
> >
> >
> >
> > All docs that have "pit" as first WORD in search field  (pit\ *)+
> >
> > All docs that have first WORD that starts with "pit"  (pit*\  *)+
> >
> > All docs that have "pit" as WORD anywhere in search field  (except first)
> > (*\ pit\ *)+
> >
> > All docs that have  a WORD starting with "pit" anywhere in search field
> > (except first) (*\ pit*\ *)+
> >
> > All docs that have "pit" as string anywhere in the search field except
> cases
> > covered above (*pit*)
> >
> >
> >
> > Example :
> >
> >
> >
> > Pit the pat
> >
> > Pit digger
> >
> > Pitch ball
> >
> > Pitcher man
> >
> > Dig a pit with shovel
> >
> > Why do you want to dig a pit with shovel
> >
> > Cricket pitch is 22 yards
> >
> > What is pithy, I don't know
> >
> > Per capita income
> >
> > Epitome of blah blah
> >
> >
> >
> >
> >
> > How can I achieve this ?
> >
> >
> >
> > Regards,
> >
> >
> >
> > Prasanna.
> >
> >
> >
>


File-based Spelling

2015-10-12 Thread Mark Fenbers

Greetings!

I'm attempting to use a file-based spell checker.  My sourceLocation is 
/usr/share/dict/linux.words, and my spellcheckIndexDir is set to 
./data/spFile.  BuildOnStartup is set to true, and I see nothing to 
suggest any sort of problem/error in solr.log.  However, in my 
./data/spFile/ directory, there are only two files: segments_2 with only 
71 bytes in it, and a zero-byte write.lock file.  For a source 
dictionary having 480,000 words in it, I was expecting a bit more 
substance in the ./data/spFile directory.  Something doesn't seem right 
with this.


Moreover, I ran a query on the word Fenbers, which isn't listed in the 
linux.words file, but there are several similar words.  The results I 
got back were odd, and suggestions included the following:

fenber
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

But I expected suggestions like fenders, embers, and fenberry, etc. I 
also ran a query on Mark (which IS listed in linux.words) and got back 
two suggestions in a similar format.  I played with configurables like 
changing the fieldType from text_en to string and the characterEncoding 
from UTF-8 to ASCII, etc., but nothing seemed to yield any different 
results.


Can anyone offer suggestions as to what I'm doing wrong?  I've been 
struggling with this for more than 40 hours now!  I'm surprised my 
persistence has lasted this long!


Thanks,
Mark


Re: How do I set up custom collection cores?

2015-10-12 Thread Shawn Heisey
On 10/12/2015 10:31 AM, espe...@oreillyauto.com wrote:
> WARNING: A docBase /var/lib/Tomcat7/webapps/solr4.war inside the host
> appBase has been specified, and will be ignored

That is a Tomcat configuration problem.

I googled to see what I could find.  It sounds to me like you have
specified both appBase and docBase in the Tomcat config that loads Solr,
and that one location is inside the other.  You might need to include
only one of them, or adjust locations so that they are in different places.

My experience with Tomcat is very limited.  If what I've said here
doesn't help you figure out the problem, then I would suggest enlisting
the help of the Tomcat project.

http://tomcat.apache.org/findhelp.html

Although there are people on this list that know a lot about Tomcat, as
of 5.0, Solr no longer officially supports deployment in third-party
containers.  With 4.x, we provide the .war file for deployment, but
every container is different, and the Solr project cannot help with
container configuration beyond simple questions about the Jetty that is
included with Solr.

Another option is migrating to Solr 5.x and using the startup scripts
provided in the download.  You would be using the Jetty that is
included, not Tomcat.

Thanks,
Shawn



Re: File-based Spelling

2015-10-12 Thread Erick Erickson
Let's see your solrconfig entries? Doubtless something innocent
seeming isn't quite right.

This might provide some clues:
http://lucidworks.com/blog/2015/03/04/solr-suggester/

The reference guide is the first place to look, a lot of this
functionality has changed
in recent years so I always try to use the Solr reference guide:
https://cwiki.apache.org/confluence/display/solr/Spell+Checking

Best,
Erick

On Mon, Oct 12, 2015 at 12:37 PM, Mark Fenbers  wrote:
> Greetings!
>
> I'm attempting to use a file-based spell checker.  My sourceLocation is
> /usr/share/dict/linux.words, and my spellcheckIndexDir is set to
> ./data/spFile.  BuildOnStartup is set to true, and I see nothing to suggest
> any sort of problem/error in solr.log.  However, in my ./data/spFile/
> directory, there are only two files: segments_2 with only 71 bytes in it,
> and a zero-byte write.lock file.  For a source dictionary having 480,000
> words in it, I was expecting a bit more substance in the ./data/spFile
> directory.  Something doesn't seem right with this.
>
> Moreover, I ran a query on the word Fenbers, which isn't listed in the
> linux.words file, but there are several similar words.  The results I got
> back were odd, and suggestions included the following:
> fenber
> f en be r
> f e nb er
> f en b er
> f e n be r
> f en b e r
> f e nb e r
> f e n b er
> f e n b e r
>
> But I expected suggestions like fenders, embers, and fenberry, etc. I also
> ran a query on Mark (which IS listed in linux.words) and got back two
> suggestions in a similar format.  I played with configurables like changing
> the fieldType from text_en to string and the characterEncoding from UTF-8 to
> ASCII, etc., but nothing seemed to yield any different results.
>
> Can anyone offer suggestions as to what I'm doing wrong?  I've been
> struggling with this for more than 40 hours now!  I'm surprised my
> persistence has lasted this long!
>
> Thanks,
> Mark


Re: AutoComplete Feature in Solr

2015-10-12 Thread Erick Erickson
Some of the links you're looking at are quite old, and a lot has
changed, assuming you're on a recent Solr version. It's usually best
to look at the Solr reference guide, see:
https://cwiki.apache.org/confluence/display/solr/Suggester

This might also help:
http://lucidworks.com/blog/2015/03/04/solr-suggester/

Best,
Erick


On Mon, Oct 12, 2015 at 1:24 PM, Salman Ansari  wrote:
> Hi,
>
> I have been trying to get the autocomplete feature in Solr working with no
> luck up to now. First I read that "suggest component" is the recommended
> way as in the below article (and this is the exact functionality I am
> looking for, which is to autocomplete multiple words)
> http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/
>
> Then I tried implementing suggest as described in the following articles in
> this order
> 1) https://wiki.apache.org/solr/Suggester#SearchHandler_configuration
> 2) http://solr.pl/en/2010/11/15/solr-and-autocomplete-part-2/  (I
> implemented suggesting phrases)
> 3)
> http://stackoverflow.com/questions/18132819/how-to-have-solr-autocomplete-on-whole-phrase-when-query-contains-multiple-terms
>
> With no luck, after implementing each article when I run my query as
> http://[MySolr]:8983/solr/entityStore114/suggest?spellcheck.q=Barack
>
>
>
> I get
> 
> 
> 0
> 0
> 
> 
>
>  Although I have an entry for Barack Obama in my index. I am posting my
> Solr configuration as well
>
> 
>  
>   suggest
>   org.apache.solr.spelling.suggest.Suggester
>name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup
>   entity_autocomplete
> true
>  
> 
>
>   class="org.apache.solr.handler.component.SearchHandler">
>  
>   true
>   suggest
>   10
> true
> false
>  
>  
>   suggest
>  
> 
>
> It looks like a very simple job, but even after following so many articles,
> I could not get it right. Any comment will be appreciated!
>
> Regards,
> Salman


Re: Grouping facets: Possible to get facet results for each Group?

2015-10-12 Thread Alexandre Rafalovitch
Could you use the new nested facets syntax? http://yonik.com/solr-subfacets/

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 11 October 2015 at 09:51, Peter Sturge  wrote:
> Been trying to coerce Group faceting to give some faceting back for each
> group, but maybe this use case isn't catered for in Grouping? :
>
> So the Use Case is this:
> Let's say I do a grouped search that returns say, 9 distinct groups, and in
> these groups are various numbers of unique field values that need faceting
> - but the faceting needs to be within each group:


Re: Highlight with NGram and German S Sharp "ß"

2015-10-12 Thread Scott Stults
My guess is that the boundary scanner isn't configured right for your
highlighter. Try setting the bs.language and bs.country parameters either
in your request or in the requestHandler.


k/r,
Scott

On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes  wrote:

> Dear Solr Users,
> I am facing a problem with highligting on ngram fields.
> Highlighting is working well, except for words with german character
> "ß".
> Eg : with q=rosen&
> "highlighting": {
> "gcl3r:12723710:6643": {
> "textng": [
> "Rosensteinpark (Métro), Stuttgart (Allemagne)"
> ]
> },
> "gcl3r:2267495:780930": {
> "textng": [
> "Rosenstraße, 94554 Moos (Allemagne)"
> ]
> }
> }
> Without "ß" words are highlight partially Rosensteinpark but
> with "ß", the whole word is highlighted (Rosenstraße)
>
> -
> This characters ß is mapped to "ss" at query and index time (using
>  mapping="mapping-ISOLatin1Accent.txt"/>
>
> )
> .
> Here the schema.xml for the highlighted field.
> 
>   
>  mapping="mapping-ISOLatin1Accent.txt"/>
> 
>  pattern="[\s,;:
> \-\']"/>
>  splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="1"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> 
>  ignoreCase="true" expand="true"/>
>  minGramSize="1"/>
> 
>   
>   
>  mapping="mapping-ISOLatin1Accent.txt"/>
> 
>  pattern="[\s,;:
> \-\']"/>
>  splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="0"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="0"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> 
> 
>  pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
>   
> 
>
> Is it a problem in our configuration or a known bug ?
> Regards
> Jérôme
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: are there any SolrCloud supervisors?

2015-10-12 Thread Scott Stults
Something like Exhibitor for Zookeeper? Very cool! Don't worry too much
about cleaning up the repo. When it comes time to integrate it with Solr or
make it an Apache top-level project you can start with a fresh commit
history :)


-Scott

On Fri, Oct 2, 2015 at 3:09 PM, r b  wrote:

> I've been working on something that just monitors ZooKeeper to add and
> remove nodes from collections. the use case being I put SolrCloud in
> an autoscaling group on EC2 and as instances go up and down, I need
> them added to the collection. It's something I've built for work and
> could clean up to share on GitHub if there is much interest.
>
> I asked in the IRC about a SolrCloud supervisor utility but wanted to
> extend that question to this list. are there any more "full featured"
> supervisors out there?
>
>
> -renning
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: Selective field query

2015-10-12 Thread Scott Stults
Colin,

The other thing you'll want to keep in mind (and you'll find this out with
debugQuery) is that the query parser is going to take your
ServiceName:(Search Service) and turn it into two queries --
ServiceName:(Search) ServiceName:(Service). That's because the query parser
breaks on whitespace. My bet is you have a lot of entries with a name of "X
Service" and the second part of your query is hitting them. Phrase Field
might be your friend here:

https://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29


-Scott

On Mon, Oct 12, 2015 at 4:15 AM, Colin Hunter  wrote:

> Thanks Erick, I'm sure this will be valuable in implementing ngram filter
> factory
>
> On Fri, Oct 9, 2015 at 4:38 PM, Erick Erickson 
> wrote:
>
> > Colin:
> >
> > Adding =all to your query is your friend here, the
> > parsed_query.toString will show you exactly what
> > is searched against.
> >
> > Best,
> > Erick
> >
> > On Fri, Oct 9, 2015 at 2:09 AM, Colin Hunter 
> wrote:
> > > Ah ha...   the copy field...  makes sense.
> > > Thank You.
> > >
> > > On Fri, Oct 9, 2015 at 10:04 AM, Upayavira  wrote:
> > >
> > >>
> > >>
> > >> On Fri, Oct 9, 2015, at 09:54 AM, Colin Hunter wrote:
> > >> > Hi
> > >> >
> > >> > I am working on a complex search utility with an index created via
> > data
> > >> > import from an extensive MySQL database.
> > >> > There are many ways in which the index is searched. One of the
> utility
> > >> > input fields searches only on a Service Name. However, if I target
> the
> > >> > query as q=ServiceName:"Searched service", this only returns an
> exact
> > >> > string match. If q=Searched Service, the query still returns results
> > from
> > >> > all indexed data.
> > >> >
> > >> > Is there a way to construct a query to only return results from one
> > field
> > >> > of a doc ?
> > >> > I have tried setting index=false, stored=true on unwanted fields,
> but
> > >> > these
> > >> > appear to have still been returned in results.
> > >>
> > >> q=ServiceName:(Searched Service)
> > >>
> > >> That'll look in just one field.
> > >>
> > >> Remember changing indexed to false doesn't impact the stuff already in
> > >> your index. And the reason you are likely getting all that stuff is
> > >> because you have a copyField that copies it over into the 'text'
> field.
> > >> If you'll never want to search on some fields, switch them to
> > >> index=false, make sure you aren't doing a copyField on them, and then
> > >> reindex.
> > >>
> > >> Upayavira
> > >>
> > >
> > >
> > >
> > > --
> > > www.gfc.uk.net
> >
>
>
>
> --
> www.gfc.uk.net
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: are there any SolrCloud supervisors?

2015-10-12 Thread Trey Grainger
I'd be very interested in taking a look if you post the code.

Trey Grainger
Co-Author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder

On Fri, Oct 2, 2015 at 3:09 PM, r b  wrote:

> I've been working on something that just monitors ZooKeeper to add and
> remove nodes from collections. the use case being I put SolrCloud in
> an autoscaling group on EC2 and as instances go up and down, I need
> them added to the collection. It's something I've built for work and
> could clean up to share on GitHub if there is much interest.
>
> I asked in the IRC about a SolrCloud supervisor utility but wanted to
> extend that question to this list. are there any more "full featured"
> supervisors out there?
>
>
> -renning
>