date:20151030

[Help]Solr_Not_Responding

2015-10-30 Thread Franky Parulian Silalahi

I have problem with my solr and i run in centos 7.
sometime my solr is detected  as down, but when i check solr's service,
that service is run. how it happen? and why ?

Re: [Help]Solr_Not_Responding

2015-10-30 Thread Modassar Ather

The information given is not sufficient to conclude a cause. You can check
the solr logs for details for any exception.

Regards,
Modassar

On Fri, Oct 30, 2015 at 10:12 AM, Franky Parulian Silalahi <
fra...@telunjuk.com> wrote:

> I have problem with my solr and i run in centos 7.
> sometime my solr is detected  as down, but when i check solr's service,
> that service is run. how it happen? and why ?
>

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Robert Oschler

Hello Walter and Mikhail,

Thank you for your answers.  Do those spell checkers have the same or
better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
distance 2)?  That's a critical requirement for my application.  I take it
by your suggestion of these spell checker apps they can easily be extended
with a user defined, supplementary dictionary, yes?

Thanks.

On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Perhaps
> FileBasedSpellChecker
> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
>
> On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler 
> wrote:
>
> > Hello everyone,
> >
> > I have a gigantic list of industry terms that I want to import into a
> > Solr/Lucene instance running on an AWS box.  What is the fastest way to
> > import the list into my Solr/Lucene instance?  I have admin/sudo
> privileges
> > on the box.
> >
> > Also, is there a document that shows me how to set up my Solr/Lucene
> config
> > file to be optimized for fast searches on single word entries using fuzzy
> > search?  I intend to use this Solr/Lucene instance to do spell checking
> on
> > the big industry word list I mentioned above.  Each data record will be a
> > single word from the file.  I'll want to take a single word query and do
> a
> > fuzzy search on the word against the index (Lichtenstein, max distance 2
> as
> > per Solr/Lucene's fuzzy search feature).  So what parameters will
> configure
> > Solr/Lucene to be optimized for such a search?  Also, if a document shows
> > the best index/read parameters to support single word fuzzy searching
> then
> > that would be a big help too.  Note, the contents of the index will
> change
> > very infrequently if that affects the optimal parameter mix.
> >
> >
> > --
> > Thanks,
> > Robert Oschler
> > Twitter -> http://twitter.com/roschler
> > http://www.RobotsRule.com/
> > http://www.Robodance.com/
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>



-- 
Thanks,
Robert Oschler
Twitter -> http://twitter.com/roschler
http://www.RobotsRule.com/
http://www.Robodance.com/

Re: Sort not working as expected

2015-10-30 Thread davidphilip cherian

You can create a copy field with string type and make it copy from this
existing field, and sort on this new one.
That way, you can still continue doing text search on existing one and sort
on this new field.





On Fri, Oct 30, 2015 at 3:04 PM, Brian Narsi  wrote:

> Is there no way that the existing field can be used?
>
>
> On Fri, Oct 30, 2015 at 1:42 PM, Ray Niu  wrote:
>
> > you should use string type instead of text if you want to sort
> > alphabetically
> >
> > 2015-10-30 11:12 GMT-07:00 Brian Narsi :
> >
> > > I have a fieldtype setup as
> > >
> > >  > positionIncrementGap=
> > > "100">   > > "solr.StandardTokenizerFactory"/>  > > "solr.LowerCaseFilterFactory"/>  > class="solr.EdgeNGramFilterFactory"
> > > minGramSize="3" maxGramSize="25"/>  
> <
> > > tokenizer class="solr.StandardTokenizerFactory"/>  > > "solr.LowerCaseFilterFactory"/>  
> > >
> > >
> > > When I sort on this field type in ascending order I am not getting
> > results
> > > sorted alphabetically as expected.
> > >
> > > Why is that?
> > >
> > > What should I do to get the sort on?
> > >
> > > Thanks
> > >
> >
>

Re: Solr Keyword query on a specific field.

2015-10-30 Thread davidphilip cherian

>> "Is there any way to have a single field search use the same keyword
search logic as the default query?"
Do a phrase search, with double quotes surrounding the multiple keywords,
it should work.

Try q=title:("Test Keywords")

You could possibly try adding this q.op as local param to query as shown
below.
https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries

If you are using edismax query parser, check for what is mm pram
set. q.op=AND => mm=100%; q.op=OR => mm=0%)
https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29


On Fri, Oct 30, 2015 at 3:27 PM, Aaron Gibbons <
agibb...@synergydatasystems.com> wrote:

> Is there any way to have a single field search use the same keyword search
> logic as the default query? I define q.op as AND in my query which gets
> applied to any main keywords but any keywords I'm trying to use within a
> field do not get the same logic applied.
> Example:
> q=(title:(Test Keywords)) the space is treated as OR regardless of q.op
> q=(Test Keywords) the space is defined by q.op which is AND
>
> Using the correct operators (AND OR * - +...) it works great as I have it
> defined. There's just this one little caveat when you use spaces between
> keywords expecting the q.op operator to be applied.
> Thanks,
> Aaron
>

Re: Using Nutch Segments

2015-10-30 Thread Imtiaz Shakil Siddique

You can check your solr admin panel . it should be like
http://localhost:8983/solr/

>From there go -->your solr core-->query
Inside the query box type *:*
Then solr will display 10 documents from its index. You can check the
fields and its contents. Solr searches in the text field out of the box.

Please look into the solr reference guide
https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
for getting started to using solr.

Regards,
Imtiaz Shakil Siddique
On Oct 28, 2015 1:25 AM, "Salonee Rege"  wrote:

>  We have crawled data using Nutch and now we want to post the nutch
> segments to Solr to index that. We are following this link
> https://wiki.apache.org/nutch/bin/nutch%20solrindex.But how to check what
> to query.As we are directly posting the JSON of the Nutch segments to
> solr.Kindly
> help.
>
> Thanks and Regards,
> *Salonee Rege*
> USC Viterbi School of Engineering
> University of Southern California
> Master of Computer Science - Student
> Computer Science - B.E
> salon...@usc.edu  *||* *619-709-6756*
>

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood

Dedicated spell-checkers have better algorithms than Solr. They usually handle 
transposed characters as well as inserted, deleted, or substituted characters. 
This is an enhanced version of Levinshtein distance. It is called 
Damerau-Levenshtein and is too expensive to use in Solr search. Spell 
correctors can also use a bigger distance than 2, unlike Solr.

The Peter Norvig corrector also handles words that have been run together. The 
Norvig corrector has been translated to many different computer languages.

The Norvig corrector is an interesting approach. It is well worth reading this 
short article to learn more about spelling correction. 

http://norvig.com/spell-correct.html 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 30, 2015, at 4:45 PM, Robert Oschler  wrote:
> 
> Hello Walter and Mikhail,
> 
> Thank you for your answers.  Do those spell checkers have the same or
> better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
> distance 2)?  That's a critical requirement for my application.  I take it
> by your suggestion of these spell checker apps they can easily be extended
> with a user defined, supplementary dictionary, yes?
> 
> Thanks.
> 
> On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
> 
>> Perhaps
>> FileBasedSpellChecker
>> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
>> 
>> On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler 
>> wrote:
>> 
>>> Hello everyone,
>>> 
>>> I have a gigantic list of industry terms that I want to import into a
>>> Solr/Lucene instance running on an AWS box.  What is the fastest way to
>>> import the list into my Solr/Lucene instance?  I have admin/sudo
>> privileges
>>> on the box.
>>> 
>>> Also, is there a document that shows me how to set up my Solr/Lucene
>> config
>>> file to be optimized for fast searches on single word entries using fuzzy
>>> search?  I intend to use this Solr/Lucene instance to do spell checking
>> on
>>> the big industry word list I mentioned above.  Each data record will be a
>>> single word from the file.  I'll want to take a single word query and do
>> a
>>> fuzzy search on the word against the index (Lichtenstein, max distance 2
>> as
>>> per Solr/Lucene's fuzzy search feature).  So what parameters will
>> configure
>>> Solr/Lucene to be optimized for such a search?  Also, if a document shows
>>> the best index/read parameters to support single word fuzzy searching
>> then
>>> that would be a big help too.  Note, the contents of the index will
>> change
>>> very infrequently if that affects the optimal parameter mix.
>>> 
>>> 
>>> --
>>> Thanks,
>>> Robert Oschler
>>> Twitter -> http://twitter.com/roschler
>>> http://www.RobotsRule.com/
>>> http://www.Robodance.com/
>>> 
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/

RE: Question on index time de-duplication

2015-10-30 Thread Markus Jelsma

Hello - keep in mind that both SignatureUpdateProcessorFactory and field 
collapsing do not work in distributed search unless you map identical 
signatures to identical shards.
Markus
 
-Original message-
> From:Scott Stults 
> Sent: Friday 30th October 2015 11:58
> To: solr-user@lucene.apache.org
> Subject: Re: Question on index time de-duplication
> 
> At the top of the De-Duplication wiki page is a note about collapsing
> results. Once you have the signature (identical for each of the duplicates)
> you'll want to collapse your results, keeping the one with max date.
> 
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> 
> 
> k/r,
> Scott
> 
> On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo 
> wrote:
> 
> > Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
> > of the content to a signature field, and group the signature field during
> > your search.
> >
> > You can find more information here:
> > https://cwiki.apache.org/confluence/display/solr/De-Duplication
> >
> > I have been using this method to group the index with duplicated content,
> > and it is working fine.
> >
> > Regards,
> > Edwin
> >
> >
> > On 30 October 2015 at 07:20, Shamik Bandopadhyay 
> > wrote:
> >
> > > Hi,
> > >
> > >   I'm looking to customizing index time de-duplication. Here's my use
> > case
> > > and what I'm trying to achieve.
> > >
> > > I've identical documents coming from different release year of a given
> > > product. I need to index them in Solr as they are required in individual
> > > year context. But there's a generic search which spans across all the
> > years
> > > and hence bring back duplicate/identical content. My goal is to only
> > return
> > > the latest document and filter out the rest. For e.g. if product A has
> > > identical documents for 2015, 2014 and 2013, search should only return
> > 2015
> > > (latest document) and filter out the rest.
> > >
> > > What I'm thinking (if possible) during index time :
> > >
> > > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> > > 2014 content, keeping 2015 (the latest release) untouched. During query
> > > time, I'll add a filter which will exclude contents tagged with "dedup".
> > >
> > > Just wondering if this is achievable by perhaps extending
> > > UpdateRequestProcessorFactory or
> > > customizing SignatureUpdateProcessorFactory ?
> > >
> > > Any pointers will be appreciated.
> > >
> > > Regards,
> > > Shamik
> > >
> >
> 
> 
> 
> -- 
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id

2015-10-30 Thread fabigol

Hi, 
great thank for your replies.
I undated that you said me the same thing. Is it right?
In some record, it is missing the id_tiers?

I have a question, how is it possible that the mapping does not work?

I'm going to check my data (response request). I must have the field
id_fields not empty?

But, i had tested required=false for the ID, and i had the same error. is it
logic?

Thank you again for your help. It allows me to better understand SOLR



--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Document-is-missing-mandatory-uniqueKey-field-id-tp4237067p4237341.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: restore quorum after majority of zk nodes down

2015-10-30 Thread Matteo Grolla

Pushkar... I love this solution
  thanks
I'd just go with 3 zk nodes on each side

2015-10-29 23:46 GMT+01:00 Pushkar Raste :

> How about having let's say 4 nodes on each side and make one node in one of
> data centers a observer. When data center with majority of the nodes go
> down, bounce the observer by reconfiguring it as a voting member.
>
> You will have to revert back the observer back to being one.
>
> There will be a short outage as far as indexing is concerned but queries
> should continue to work and you don't have to take all the zookeeper nodes
> down.
>
> -- Pushkar Raste
> On Oct 29, 2015 4:33 PM, "Matteo Grolla"  wrote:
>
> > Hi Walter,
> >   it's not a problem to take down zk for a short (1h) time and
> > reconfigure it. Meanwhile solr would go in readonly mode.
> > I'd like feedback on the fastest way to do this. Would it work to just
> > reconfigure the cluster with other 2 empty zk nodes? Would they correctly
> > sync from the nonempty one? Should first copy data from zk3 to the two
> > empty zk?
> > Matteo
> >
> >
> > 2015-10-29 18:34 GMT+01:00 Walter Underwood :
> >
> > > You can't. Zookeeper needs a majority. One node is not a majority of a
> > > three node ensemble.
> > >
> > > There is no way to split a Solr Cloud cluster across two datacenters
> and
> > > have high availability. You can do that with three datacenters.
> > >
> > > You can probably bring up a new Zookeeper ensemble and configure the
> Solr
> > > cluster to talk to it.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >
> > > > On Oct 29, 2015, at 10:08 AM, Matteo Grolla  >
> > > wrote:
> > > >
> > > > I'm designing a solr cloud installation where nodes from a single
> > cluster
> > > > are distributed on 2 datacenters which are close and very well
> > connected.
> > > > let's say that zk nodes zk1, zk2 are on DC1 and zk2 is on DC2 and
> let's
> > > say
> > > > that DC1 goes down and the cluster is left with zk3.
> > > > how can I restore a zk quorum from this situation?
> > > >
> > > > thanks
> > >
> > >
> >
>

RE: SolrJ stalls/hangs on client.add(); and doesn't return

2015-10-30 Thread Markus Jelsma

Hi - Solr doesn't seem to receive anything, and it certainly doesn't log 
anything, nothing is running out of memory. Indeed, i was clearly 
misunderstanding ConcurrentUpdateSolrClient.

I hoped, without reading its code, it would partition input, which it clearly 
doesn't. I changed the code to partition my own input up to 50k documents and 
everything is running fine.

Markus

 
 
-Original message-
> From:Erick Erickson 
> Sent: Thursday 29th October 2015 22:28
> To: solr-user 
> Subject: Re: SolrJ stalls/hangs on client.add(); and doesn't return
> 
> You're sending 100K docs in a single packet? It's vaguely possible that you're
> getting a timeout although that doesn't square with no docs being indexed...
> 
> Hmmm, to check you could do a manual commit. Or watch the Solr log to
> see if update
> requests ever go there.
> 
> Or you're running out of memory on the client.
> 
> Or even exceeding the packet size that the servlet container will accept?
> 
> But I think at root you're misunderstanding
> ConcurrentUpdateSolrClient. It doesn't
> partition up a huge array and send them in parallel, it parallelized sending 
> the
> packet each call is given. So it's trying to send all 100K docs at
> once. Probably not
> what you were aiming for.
> 
> Try making batches of 1,000 docs and sending them through instead.
> 
> So the parameters are a bit of magic. You can have up to the number of threads
> you specify sending their entire packet to solr in parallel, and up to 
> queueSize
> requests. Note this is the _request_, not the docs in the list if I'm
> reading the code
> correctly.
> 
> Best,
> Erick
> 
> On Thu, Oct 29, 2015 at 1:52 AM, Markus Jelsma
>  wrote:
> > Hello - we have some processes periodically sending documents to 5.3.0 in 
> > local mode using ConcurrentUpdateSolrClient 5.3.0, it has queueSize 10 and 
> > threadCount 4, just chosen arbitrarily having no idea what is right.
> >
> > Usually its a few thousand up to some tens of thousands of rather small 
> > documents. Now, when the number of documents is around or near a hundred 
> > thousand, client.add(Iterator docIterator) stalls and 
> > never returns. It also doesn't index any of the documents. Upon calling, it 
> > quickly eats CPU and a load of heap but shortly after it goes idle, no CPU 
> > and memory is released.
> >
> > I am puzzled, any ideas to share?
> > Markus
>

Re: Performance degradation with two collection on same sole instance

2015-10-30 Thread Toke Eskildsen

On Tue, 2015-10-27 at 12:12 -0700, SolrUser1543 wrote:
> The question is , how Solr manages its resources when it has more than one
> core ? Does it need twice memory ? Or this degradation might be a
> coincidence ?

There is an overhead for each core, but not much. You should not notice
any performance degradation from the secondary tiny core. An easy way to
check is to remove the extra core and see if that returns performance
back to the previous level. Preferably you should do this without
restarting the Solr instance as doing so affects performance temporarily
until it stabilizes.

- Toke Eskildsen, State and University Library, Denmark

Re: Performance degradation with two collection on same sole instance

2015-10-30 Thread Jan Høydahl

You say you configure 20Gb heap. What is your total physical RAM on the host?
What are your cache sizes for the two collections?
If you have too high cache settings you may eat too much memory..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 27. okt. 2015 kl. 20.12 skrev SolrUser1543 :
> 
> we have a large solr cloud , with one collection and no replicas .
> Each machine has one solr core .
> Recently we decided to add a new collection , based on same schema , so now
> each solr instance has two cores .
> First collection has very big index , but the new one has only several
> hundreds of documents.
> 
> Day after we did it , we experienced very strong performance degradation,
> like long query times and server unavailability.
> 
> JVM was configured to 20GB heap , and we did not changed it during addition
> of a new collection.
> 
> The question is , how Solr manages its resources when it has more than one
> core ? Does it need twice memory ? Or this degradation might be a
> coincidence ?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-degradation-with-two-collection-on-same-sole-instance-tp4236774.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Securing field level access permission by filtering the query itself

2015-10-30 Thread Scott Stults

Douglas,

Managing a per-user-group whitelist of fields outside of Solr seems the
best approach. When the query comes in you can then filter out any fields
not contained in the whitelist before you send the request to Solr. The
easy part will be to do that on URL parameters like fl. Depending on how
your app generates the actual query string, you may want to also scan that
for fielded query clauses (eg "badfield:value") and localParams (eg
"{!dismax qf=badfield}value").

Secondly, you can map internal Solr fields to aliases using this syntax in
the fl parameter: "display_name:real_solr_name". So when the request comes
in from your app, first you'll map from the requested field alias names to
internal Solr names (while enforcing the whitelist), and then in the fl
parameter supply the aliases you want sent in the response.

k/r,
Scott

On Wed, Oct 28, 2015 at 6:58 PM, Douglas McGilvray  wrote:

> Hi all,
>
> First I’d like to say the nested facets and the json facet api in
> particular have made my world much better, I thank everyone involved, you
> are all awesome.
>
> In my implementation has much of the solr query building working on the
> browser, solr is behind a php server which acts as “proxy” and doorman,
> filtering at the document level according to user role and supplying some
> sensible maximums …
>
> However we now wish to filter just one or two potentially sensitive fields
> in one document type according to user role (as determined in the php
> proxy). Duplicating documents (or cores) seems like overkill for just two
> fields in one document type .. I wondered if it would be feasible (in the
> interests of preventing malicious activity) to filter the query itself
> whether it be parameters (fl, facet.fields, terms, etc) … or even deny any
> request in which fieldname occurs …
>
> Is there someway someone might obscure a fieldname in a request?
>
> Kind Regards & thanks in davacne,
> Douglas

-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id

2015-10-30 Thread Mikhail Khludnev

Try to remove http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field

On Fri, Oct 30, 2015 at 12:51 PM, fabigol 
wrote:

> Hi,
> great thank for your replies.
> I undated that you said me the same thing. Is it right?
> In some record, it is missing the id_tiers?
>
> I have a question, how is it possible that the mapping does not work?
>
> I'm going to check my data (response request). I must have the field
> id_fields not empty?
>
> But, i had tested required=false for the ID, and i had the same error. is
> it
> logic?
>
> Thank you again for your help. It allows me to better understand SOLR
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Document-is-missing-mandatory-uniqueKey-field-id-tp4237067p4237341.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Re: Sort not working as expected

2015-10-30 Thread Erick Erickson

bq: Is there no way that the existing field can be used?

In a word, "no". The indexed terms are being used for sorting. You
have a document
that has the title "aardvark zebra". The actual _tokens_ are
aardvark
zebra

solr/Lucene has no way of knowing whether these should be sorted by "a"
or "z".

The copyfield to a string type then sorting by that is a good idea, although
you may want to normalize the field (i.e. lowercase, remove punctuation,
possibly remove stopwords etc).

Best,
Erick

On Fri, Oct 30, 2015 at 3:27 PM, davidphilip cherian
 wrote:
> You can create a copy field with string type and make it copy from this
> existing field, and sort on this new one.
> That way, you can still continue doing text search on existing one and sort
> on this new field.
>
>
>
>
>
> On Fri, Oct 30, 2015 at 3:04 PM, Brian Narsi  wrote:
>
>> Is there no way that the existing field can be used?
>>
>>
>> On Fri, Oct 30, 2015 at 1:42 PM, Ray Niu  wrote:
>>
>> > you should use string type instead of text if you want to sort
>> > alphabetically
>> >
>> > 2015-10-30 11:12 GMT-07:00 Brian Narsi :
>> >
>> > > I have a fieldtype setup as
>> > >
>> > > > > positionIncrementGap=
>> > > "100">  > > > "solr.StandardTokenizerFactory"/> > > > "solr.LowerCaseFilterFactory"/> > > class="solr.EdgeNGramFilterFactory"
>> > > minGramSize="3" maxGramSize="25"/>  
>> <
>> > > tokenizer class="solr.StandardTokenizerFactory"/> > > > "solr.LowerCaseFilterFactory"/>  
>> > >
>> > >
>> > > When I sort on this field type in ascending order I am not getting
>> > results
>> > > sorted alphabetically as expected.
>> > >
>> > > Why is that?
>> > >
>> > > What should I do to get the sort on?
>> > >
>> > > Thanks
>> > >
>> >
>>

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Robert Oschler

Thanks Walter.  Are there any open source spell checkers that implement the
Peter Norvig or Damerau-Levenshtein algorithms?  I'm short on time so I
have to keep the custom coding down to a minimum.


On Fri, Oct 30, 2015 at 8:02 PM, Walter Underwood 
wrote:

> Dedicated spell-checkers have better algorithms than Solr. They usually
> handle transposed characters as well as inserted, deleted, or substituted
> characters. This is an enhanced version of Levinshtein distance. It is
> called Damerau-Levenshtein and is too expensive to use in Solr search.
> Spell correctors can also use a bigger distance than 2, unlike Solr.
>
> The Peter Norvig corrector also handles words that have been run together.
> The Norvig corrector has been translated to many different computer
> languages.
>
> The Norvig corrector is an interesting approach. It is well worth reading
> this short article to learn more about spelling correction.
>
> http://norvig.com/spell-correct.html  >
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Oct 30, 2015, at 4:45 PM, Robert Oschler 
> wrote:
> >
> > Hello Walter and Mikhail,
> >
> > Thank you for your answers.  Do those spell checkers have the same or
> > better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
> > distance 2)?  That's a critical requirement for my application.  I take
> it
> > by your suggestion of these spell checker apps they can easily be
> extended
> > with a user defined, supplementary dictionary, yes?
> >
> > Thanks.
> >
> > On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
> > mkhlud...@griddynamics.com> wrote:
> >
> >> Perhaps
> >> FileBasedSpellChecker
> >> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
> >>
> >> On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler <
> robert.osch...@gmail.com>
> >> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> I have a gigantic list of industry terms that I want to import into a
> >>> Solr/Lucene instance running on an AWS box.  What is the fastest way to
> >>> import the list into my Solr/Lucene instance?  I have admin/sudo
> >> privileges
> >>> on the box.
> >>>
> >>> Also, is there a document that shows me how to set up my Solr/Lucene
> >> config
> >>> file to be optimized for fast searches on single word entries using
> fuzzy
> >>> search?  I intend to use this Solr/Lucene instance to do spell checking
> >> on
> >>> the big industry word list I mentioned above.  Each data record will
> be a
> >>> single word from the file.  I'll want to take a single word query and
> do
> >> a
> >>> fuzzy search on the word against the index (Lichtenstein, max distance
> 2
> >> as
> >>> per Solr/Lucene's fuzzy search feature).  So what parameters will
> >> configure
> >>> Solr/Lucene to be optimized for such a search?  Also, if a document
> shows
> >>> the best index/read parameters to support single word fuzzy searching
> >> then
> >>> that would be a big help too.  Note, the contents of the index will
> >> change
> >>> very infrequently if that affects the optimal parameter mix.
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Robert Oschler
> >>> Twitter -> http://twitter.com/roschler
> >>> http://www.RobotsRule.com/
> >>> http://www.Robodance.com/
> >>>
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Principal Engineer,
> >> Grid Dynamics
> >>
> >> 
> >> 
> >>
> >
> >
> >
> > --
> > Thanks,
> > Robert Oschler
> > Twitter -> http://twitter.com/roschler
> > http://www.RobotsRule.com/
> > http://www.Robodance.com/
>
>


-- 
Thanks,
Robert Oschler
Twitter -> http://twitter.com/roschler
http://www.RobotsRule.com/
http://www.Robodance.com/

Re: restore quorum after majority of zk nodes down

2015-10-30 Thread Pushkar Raste

We need bounce it, but outage will be very short and you don't have to take
down rest of the zookeeper instances.

On 30 October 2015 at 11:00, Daniel Collins  wrote:

> Aren't you asking for dynamic ZK configuration which isn't supported yet
> (ZOOKEEPER-107, only in in 3.5.0-alpha)?  How do you swap a zookeeper
> instance from being an observer to a voting member?
>
> On 30 October 2015 at 09:34, Matteo Grolla 
> wrote:
>
> > Pushkar... I love this solution
> >   thanks
> > I'd just go with 3 zk nodes on each side
> >
> > 2015-10-29 23:46 GMT+01:00 Pushkar Raste :
> >
> > > How about having let's say 4 nodes on each side and make one node in
> one
> > of
> > > data centers a observer. When data center with majority of the nodes go
> > > down, bounce the observer by reconfiguring it as a voting member.
> > >
> > > You will have to revert back the observer back to being one.
> > >
> > > There will be a short outage as far as indexing is concerned but
> queries
> > > should continue to work and you don't have to take all the zookeeper
> > nodes
> > > down.
> > >
> > > -- Pushkar Raste
> > > On Oct 29, 2015 4:33 PM, "Matteo Grolla" 
> > wrote:
> > >
> > > > Hi Walter,
> > > >   it's not a problem to take down zk for a short (1h) time and
> > > > reconfigure it. Meanwhile solr would go in readonly mode.
> > > > I'd like feedback on the fastest way to do this. Would it work to
> just
> > > > reconfigure the cluster with other 2 empty zk nodes? Would they
> > correctly
> > > > sync from the nonempty one? Should first copy data from zk3 to the
> two
> > > > empty zk?
> > > > Matteo
> > > >
> > > >
> > > > 2015-10-29 18:34 GMT+01:00 Walter Underwood :
> > > >
> > > > > You can't. Zookeeper needs a majority. One node is not a majority
> of
> > a
> > > > > three node ensemble.
> > > > >
> > > > > There is no way to split a Solr Cloud cluster across two
> datacenters
> > > and
> > > > > have high availability. You can do that with three datacenters.
> > > > >
> > > > > You can probably bring up a new Zookeeper ensemble and configure
> the
> > > Solr
> > > > > cluster to talk to it.
> > > > >
> > > > > wunder
> > > > > Walter Underwood
> > > > > wun...@wunderwood.org
> > > > > http://observer.wunderwood.org/  (my blog)
> > > > >
> > > > >
> > > > > > On Oct 29, 2015, at 10:08 AM, Matteo Grolla <
> > matteo.gro...@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > I'm designing a solr cloud installation where nodes from a single
> > > > cluster
> > > > > > are distributed on 2 datacenters which are close and very well
> > > > connected.
> > > > > > let's say that zk nodes zk1, zk2 are on DC1 and zk2 is on DC2 and
> > > let's
> > > > > say
> > > > > > that DC1 goes down and the cluster is left with zk3.
> > > > > > how can I restore a zk quorum from this situation?
> > > > > >
> > > > > > thanks
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood

Read the links I have sent.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 30, 2015, at 7:10 PM, Robert Oschler  wrote:
> 
> Thanks Walter.  Are there any open source spell checkers that implement the
> Peter Norvig or Damerau-Levenshtein algorithms?  I'm short on time so I
> have to keep the custom coding down to a minimum.
> 
> 
> On Fri, Oct 30, 2015 at 8:02 PM, Walter Underwood 
> wrote:
> 
>> Dedicated spell-checkers have better algorithms than Solr. They usually
>> handle transposed characters as well as inserted, deleted, or substituted
>> characters. This is an enhanced version of Levinshtein distance. It is
>> called Damerau-Levenshtein and is too expensive to use in Solr search.
>> Spell correctors can also use a bigger distance than 2, unlike Solr.
>> 
>> The Peter Norvig corrector also handles words that have been run together.
>> The Norvig corrector has been translated to many different computer
>> languages.
>> 
>> The Norvig corrector is an interesting approach. It is well worth reading
>> this short article to learn more about spelling correction.
>> 
>> http://norvig.com/spell-correct.html >> 
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Oct 30, 2015, at 4:45 PM, Robert Oschler 
>> wrote:
>>> 
>>> Hello Walter and Mikhail,
>>> 
>>> Thank you for your answers.  Do those spell checkers have the same or
>>> better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
>>> distance 2)?  That's a critical requirement for my application.  I take
>> it
>>> by your suggestion of these spell checker apps they can easily be
>> extended
>>> with a user defined, supplementary dictionary, yes?
>>> 
>>> Thanks.
>>> 
>>> On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
>>> mkhlud...@griddynamics.com> wrote:
>>> 
 Perhaps
 FileBasedSpellChecker
 https://cwiki.apache.org/confluence/display/solr/Spell+Checking
 
 On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler <
>> robert.osch...@gmail.com>
 wrote:
 
> Hello everyone,
> 
> I have a gigantic list of industry terms that I want to import into a
> Solr/Lucene instance running on an AWS box.  What is the fastest way to
> import the list into my Solr/Lucene instance?  I have admin/sudo
 privileges
> on the box.
> 
> Also, is there a document that shows me how to set up my Solr/Lucene
 config
> file to be optimized for fast searches on single word entries using
>> fuzzy
> search?  I intend to use this Solr/Lucene instance to do spell checking
 on
> the big industry word list I mentioned above.  Each data record will
>> be a
> single word from the file.  I'll want to take a single word query and
>> do
 a
> fuzzy search on the word against the index (Lichtenstein, max distance
>> 2
 as
> per Solr/Lucene's fuzzy search feature).  So what parameters will
 configure
> Solr/Lucene to be optimized for such a search?  Also, if a document
>> shows
> the best index/read parameters to support single word fuzzy searching
 then
> that would be a big help too.  Note, the contents of the index will
 change
> very infrequently if that affects the optimal parameter mix.
> 
> 
> --
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/
> 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 
 
 
>>> 
>>> 
>>> 
>>> --
>>> Thanks,
>>> Robert Oschler
>>> Twitter -> http://twitter.com/roschler
>>> http://www.RobotsRule.com/
>>> http://www.Robodance.com/
>> 
>> 
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/

Re: Problem with the Content Field during Solr Indexing

2015-10-30 Thread Shruti Mundra

Hi Edwin,

The file extension of the image file is ".png" and we are following this
url for indexing:
"
http://blog.thedigitalgroup.com/vijaym/wp-content/uploads/sites/11/2015/07/SolrImageExtract.png
"

Thanks and Regards,
Shruti Mundra

On Thu, Oct 29, 2015 at 8:33 PM, Zheng Lin Edwin Yeo 
wrote:

> The "\n" actually means new line as decoded by Solr from the indexed
> document.
>
> What is your file extension of your image file, and which method are you
> using to do the indexing?
>
> Regards,
> Edwin
>
>
> On 30 October 2015 at 04:38, Shruti Mundra  wrote:
>
> > Hi,
> >
> > When I'm trying index an image file directly to Solr, the attribute
> > content, consists of trails of "\n"s and not the data.
> > We are successful in getting the metadata for that image.
> >
> > Can anyone help us out on how we could get the content along with the
> > Metadata.
> >
> > Thanks!
> >
> > - Shruti Mundra
> >
>

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Robert Oschler

Thanks Walter.   I believe I have what I need now.  Have a great weekend.

On Fri, Oct 30, 2015 at 11:13 PM, Walter Underwood 
wrote:

> Read the links I have sent.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Oct 30, 2015, at 7:10 PM, Robert Oschler 
> wrote:
> >
> > Thanks Walter.  Are there any open source spell checkers that implement
> the
> > Peter Norvig or Damerau-Levenshtein algorithms?  I'm short on time so I
> > have to keep the custom coding down to a minimum.
> >
> >
> > On Fri, Oct 30, 2015 at 8:02 PM, Walter Underwood  >
> > wrote:
> >
> >> Dedicated spell-checkers have better algorithms than Solr. They usually
> >> handle transposed characters as well as inserted, deleted, or
> substituted
> >> characters. This is an enhanced version of Levinshtein distance. It is
> >> called Damerau-Levenshtein and is too expensive to use in Solr search.
> >> Spell correctors can also use a bigger distance than 2, unlike Solr.
> >>
> >> The Peter Norvig corrector also handles words that have been run
> together.
> >> The Norvig corrector has been translated to many different computer
> >> languages.
> >>
> >> The Norvig corrector is an interesting approach. It is well worth
> reading
> >> this short article to learn more about spelling correction.
> >>
> >> http://norvig.com/spell-correct.html <
> http://norvig.com/spell-correct.html
> >>>
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Oct 30, 2015, at 4:45 PM, Robert Oschler 
> >> wrote:
> >>>
> >>> Hello Walter and Mikhail,
> >>>
> >>> Thank you for your answers.  Do those spell checkers have the same or
> >>> better fuzzy matching capability that SOLR/Lucene has (Lichtenstein,
> max
> >>> distance 2)?  That's a critical requirement for my application.  I take
> >> it
> >>> by your suggestion of these spell checker apps they can easily be
> >> extended
> >>> with a user defined, supplementary dictionary, yes?
> >>>
> >>> Thanks.
> >>>
> >>> On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
> >>> mkhlud...@griddynamics.com> wrote:
> >>>
>  Perhaps
>  FileBasedSpellChecker
>  https://cwiki.apache.org/confluence/display/solr/Spell+Checking
> 
>  On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler <
> >> robert.osch...@gmail.com>
>  wrote:
> 
> > Hello everyone,
> >
> > I have a gigantic list of industry terms that I want to import into a
> > Solr/Lucene instance running on an AWS box.  What is the fastest way
> to
> > import the list into my Solr/Lucene instance?  I have admin/sudo
>  privileges
> > on the box.
> >
> > Also, is there a document that shows me how to set up my Solr/Lucene
>  config
> > file to be optimized for fast searches on single word entries using
> >> fuzzy
> > search?  I intend to use this Solr/Lucene instance to do spell
> checking
>  on
> > the big industry word list I mentioned above.  Each data record will
> >> be a
> > single word from the file.  I'll want to take a single word query and
> >> do
>  a
> > fuzzy search on the word against the index (Lichtenstein, max
> distance
> >> 2
>  as
> > per Solr/Lucene's fuzzy search feature).  So what parameters will
>  configure
> > Solr/Lucene to be optimized for such a search?  Also, if a document
> >> shows
> > the best index/read parameters to support single word fuzzy searching
>  then
> > that would be a big help too.  Note, the contents of the index will
>  change
> > very infrequently if that affects the optimal parameter mix.
> >
> >
> > --
> > Thanks,
> > Robert Oschler
> > Twitter -> http://twitter.com/roschler
> > http://www.RobotsRule.com/
> > http://www.Robodance.com/
> >
> 
> 
> 
>  --
>  Sincerely yours
>  Mikhail Khludnev
>  Principal Engineer,
>  Grid Dynamics
> 
>  
>  
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Robert Oschler
> >>> Twitter -> http://twitter.com/roschler
> >>> http://www.RobotsRule.com/
> >>> http://www.Robodance.com/
> >>
> >>
> >
> >
> > --
> > Thanks,
> > Robert Oschler
> > Twitter -> http://twitter.com/roschler
> > http://www.RobotsRule.com/
> > http://www.Robodance.com/
>
>


-- 
Thanks,
Robert Oschler
Twitter -> http://twitter.com/roschler
http://www.RobotsRule.com/
http://www.Robodance.com/

Re: How to get values of external file field(s) in Solr query?

2015-10-30 Thread chitrapatel

I have implemented ExternalFileField in solr. But I am not able to write
values into external file "external_BestSellerTest" that is located into
~\data directory inside solr core.

My application and Solr Server were physically separated on two place.
Application will calculate a score and generate a file for
ExternalFileField. So, I need to post this external file into solr
directory. How to do this?

Help me to write values into that external file.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-values-of-external-file-field-s-in-Solr-query-tp4073467p4237367.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrJ stalls/hangs on client.add(); and doesn't return

2015-10-30 Thread Susheel Kumar

Just a suggestion Markus that sending 50k documents in your case worked but
you may want to benchmark sending batches in 5K, 10k or 20k batches and
compare with sending 50k batches.  It may turn out that smaller batch size
may be faster than very big batch size...

On Fri, Oct 30, 2015 at 7:59 AM, Markus Jelsma 
wrote:

> Hi - Solr doesn't seem to receive anything, and it certainly doesn't log
> anything, nothing is running out of memory. Indeed, i was clearly
> misunderstanding ConcurrentUpdateSolrClient.
>
> I hoped, without reading its code, it would partition input, which it
> clearly doesn't. I changed the code to partition my own input up to 50k
> documents and everything is running fine.
>
> Markus
>
>
>
> -Original message-
> > From:Erick Erickson 
> > Sent: Thursday 29th October 2015 22:28
> > To: solr-user 
> > Subject: Re: SolrJ stalls/hangs on client.add(); and doesn't return
> >
> > You're sending 100K docs in a single packet? It's vaguely possible that
> you're
> > getting a timeout although that doesn't square with no docs being
> indexed...
> >
> > Hmmm, to check you could do a manual commit. Or watch the Solr log to
> > see if update
> > requests ever go there.
> >
> > Or you're running out of memory on the client.
> >
> > Or even exceeding the packet size that the servlet container will accept?
> >
> > But I think at root you're misunderstanding
> > ConcurrentUpdateSolrClient. It doesn't
> > partition up a huge array and send them in parallel, it parallelized
> sending the
> > packet each call is given. So it's trying to send all 100K docs at
> > once. Probably not
> > what you were aiming for.
> >
> > Try making batches of 1,000 docs and sending them through instead.
> >
> > So the parameters are a bit of magic. You can have up to the number of
> threads
> > you specify sending their entire packet to solr in parallel, and up to
> queueSize
> > requests. Note this is the _request_, not the docs in the list if I'm
> > reading the code
> > correctly.
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 29, 2015 at 1:52 AM, Markus Jelsma
> >  wrote:
> > > Hello - we have some processes periodically sending documents to 5.3.0
> in local mode using ConcurrentUpdateSolrClient 5.3.0, it has queueSize 10
> and threadCount 4, just chosen arbitrarily having no idea what is right.
> > >
> > > Usually its a few thousand up to some tens of thousands of rather
> small documents. Now, when the number of documents is around or near a
> hundred thousand, client.add(Iterator docIterator)
> stalls and never returns. It also doesn't index any of the documents. Upon
> calling, it quickly eats CPU and a load of heap but shortly after it goes
> idle, no CPU and memory is released.
> > >
> > > I am puzzled, any ideas to share?
> > > Markus
> >
>

Re: SolrJ stalls/hangs on client.add(); and doesn't return

2015-10-30 Thread Yonik Seeley

On Thu, Oct 29, 2015 at 5:28 PM, Erick Erickson  wrote:
> Try making batches of 1,000 docs and sending them through instead.

The other thing about ConcurrentUpdateSolrClient is that it will
create batches itself while streaming.
For example, if you call add a number of  times very quickly, those
will all be put in the same update request as they are being streamed
(you get the benefits of batching without the latency it would
normally come with.)

So I guess I'd advise to not batch yourself unless it makes more sense
for your document processing for other reasons.

-Yonik

Re: SolrJ stalls/hangs on client.add(); and doesn't return

2015-10-30 Thread Erick Erickson

Glad you can solve it one way or the other. I do wonder, though what's
really going on, the fact that your original case just hung is kind of
disturbing.

50K is still a lot, and Yonik's comment is well taken. I did some benchmarking
(not ConcurrentUpdateSolrServer, HttpSolrClient as I remember) and got
diminishing returns pretty rapidly after the first few, see:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

There was a huge jump going from 1 to 10, a smaller (percentage wise)
jump going from 10 to100 and not much to talk about between 100 and 1,000
(single threaded, YMMV of course).

Best
Erick

On Fri, Oct 30, 2015 at 6:26 AM, Yonik Seeley  wrote:
> On Thu, Oct 29, 2015 at 5:28 PM, Erick Erickson  
> wrote:
>> Try making batches of 1,000 docs and sending them through instead.
>
> The other thing about ConcurrentUpdateSolrClient is that it will
> create batches itself while streaming.
> For example, if you call add a number of  times very quickly, those
> will all be put in the same update request as they are being streamed
> (you get the benefits of batching without the latency it would
> normally come with.)
>
> So I guess I'd advise to not batch yourself unless it makes more sense
> for your document processing for other reasons.
>
> -Yonik

Re: restore quorum after majority of zk nodes down

2015-10-30 Thread Daniel Collins

Aren't you asking for dynamic ZK configuration which isn't supported yet
(ZOOKEEPER-107, only in in 3.5.0-alpha)?  How do you swap a zookeeper
instance from being an observer to a voting member?

On 30 October 2015 at 09:34, Matteo Grolla  wrote:

> Pushkar... I love this solution
>   thanks
> I'd just go with 3 zk nodes on each side
>
> 2015-10-29 23:46 GMT+01:00 Pushkar Raste :
>
> > How about having let's say 4 nodes on each side and make one node in one
> of
> > data centers a observer. When data center with majority of the nodes go
> > down, bounce the observer by reconfiguring it as a voting member.
> >
> > You will have to revert back the observer back to being one.
> >
> > There will be a short outage as far as indexing is concerned but queries
> > should continue to work and you don't have to take all the zookeeper
> nodes
> > down.
> >
> > -- Pushkar Raste
> > On Oct 29, 2015 4:33 PM, "Matteo Grolla" 
> wrote:
> >
> > > Hi Walter,
> > >   it's not a problem to take down zk for a short (1h) time and
> > > reconfigure it. Meanwhile solr would go in readonly mode.
> > > I'd like feedback on the fastest way to do this. Would it work to just
> > > reconfigure the cluster with other 2 empty zk nodes? Would they
> correctly
> > > sync from the nonempty one? Should first copy data from zk3 to the two
> > > empty zk?
> > > Matteo
> > >
> > >
> > > 2015-10-29 18:34 GMT+01:00 Walter Underwood :
> > >
> > > > You can't. Zookeeper needs a majority. One node is not a majority of
> a
> > > > three node ensemble.
> > > >
> > > > There is no way to split a Solr Cloud cluster across two datacenters
> > and
> > > > have high availability. You can do that with three datacenters.
> > > >
> > > > You can probably bring up a new Zookeeper ensemble and configure the
> > Solr
> > > > cluster to talk to it.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > >
> > > > > On Oct 29, 2015, at 10:08 AM, Matteo Grolla <
> matteo.gro...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > I'm designing a solr cloud installation where nodes from a single
> > > cluster
> > > > > are distributed on 2 datacenters which are close and very well
> > > connected.
> > > > > let's say that zk nodes zk1, zk2 are on DC1 and zk2 is on DC2 and
> > let's
> > > > say
> > > > > that DC1 goes down and the cluster is left with zk3.
> > > > > how can I restore a zk quorum from this situation?
> > > > >
> > > > > thanks
> > > >
> > > >
> > >
> >
>

Re: Securing field level access permission by filtering the query itself

2015-10-30 Thread Douglas McGilvray


Scott thanks for the reply. I like the idea of mapping all the fieldnames 
internally, adding security through obscurity. My question therefore would be 
what is the definitive list of query parameters that one must filter to ensure 
a particular field is not exposed in the query response? Am I missing in the 
following?

fl
facect.field
facet.pivot
json.facet
terms.fl


kr
Douglas


> On 30 Oct 2015, at 07:37, Scott Stults  
> wrote:
> 
> Douglas,
> 
> Managing a per-user-group whitelist of fields outside of Solr seems the
> best approach. When the query comes in you can then filter out any fields
> not contained in the whitelist before you send the request to Solr. The
> easy part will be to do that on URL parameters like fl. Depending on how
> your app generates the actual query string, you may want to also scan that
> for fielded query clauses (eg "badfield:value") and localParams (eg
> "{!dismax qf=badfield}value").
> 
> Secondly, you can map internal Solr fields to aliases using this syntax in
> the fl parameter: "display_name:real_solr_name". So when the request comes
> in from your app, first you'll map from the requested field alias names to
> internal Solr names (while enforcing the whitelist), and then in the fl
> parameter supply the aliases you want sent in the response.
> 
> 
> k/r,
> Scott
> 
> On Wed, Oct 28, 2015 at 6:58 PM, Douglas McGilvray  wrote:
> 
>> Hi all,
>> 
>> First I’d like to say the nested facets and the json facet api in
>> particular have made my world much better, I thank everyone involved, you
>> are all awesome.
>> 
>> In my implementation has much of the solr query building working on the
>> browser, solr is behind a php server which acts as “proxy” and doorman,
>> filtering at the document level according to user role and supplying some
>> sensible maximums …
>> 
>> However we now wish to filter just one or two potentially sensitive fields
>> in one document type according to user role (as determined in the php
>> proxy). Duplicating documents (or cores) seems like overkill for just two
>> fields in one document type .. I wondered if it would be feasible (in the
>> interests of preventing malicious activity) to filter the query itself
>> whether it be parameters (fl, facet.fields, terms, etc) … or even deny any
>> request in which fieldname occurs …
>> 
>> Is there someway someone might obscure a fieldname in a request?
>> 
>> Kind Regards & thanks in davacne,
>> Douglas
> 
> 
> 
> 
> -- 
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com

growth of tlog

2015-10-30 Thread Rallavagu


4.10.4 solr cloud, 3 zk quorum, jdk 8

autocommit: 15 sec, softcommit: 2 min

Under heavy indexing load with above settings, i have seen tlog growing 
(into GB). After the updates stopped coming in, it settles down and 
takes a while to recover before cloud becomes "green".


With 15 second autocommit setting, what could potentially cause tlog to 
grow? What to look for?

Re: growth of tlog

2015-10-30 Thread Rallavagu

On 10/30/15 8:39 AM, Erick Erickson wrote:

I infer that this statement: "takes a while to recover before cloud
becomes green"
indicates that the node is in recovery or something while indexing. If you're
still indexing, the new documents will be written to the followers
tlog while the
follower is recovering, leading to it growing. I expect that after followers
all recover, the tlog shrinks after a few commits have gone by.

Correct. The recovery time is extended though. Also, this affects
available physical memory as tlog continues to grow and it is memory mapped.

If that's all true, the question is why the follower goes into
recovery in the first
place. Prior to 5.2, there was a situation in which very heavy indexing
could cause a follower to go into Leader Initiated Recovery (LIR) (look for this
in both the leader and follower logs). Here's the blog Tim Potter wrote
on this subject:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/

The smoking gun here is
1> heavy indexing is required
2> the _leader_ stays up
3> the _follower_ goes into recovery for no readily apparent reason
4> the nail in the coffin for this particular issue is seeing that the follower
went into LIR.
5> You'll also see a very large number of threads on the leader waiting
on sending the updates to the follower.

If this is a problem, prior to 5.2 there are really only two solutions
1> throttle indexing
2> take all of the followers offline during indexing. When indexing is
completed, bring the followers back up and let them replicate the
full index down from the leader.
Other than shutting followers down, is there a elegant/graceful way of
taking follower nodes offline? Also, to give you more idea, as per the
following document I am testing "Index heavy, Query heavy" situation.

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks

Best,
Erick

On Fri, Oct 30, 2015 at 8:28 AM, Rallavagu wrote:

4.10.4 solr cloud, 3 zk quorum, jdk 8

autocommit: 15 sec, softcommit: 2 min

Under heavy indexing load with above settings, i have seen tlog growing
(into GB). After the updates stopped coming in, it settles down and takes a
while to recover before cloud becomes "green".

With 15 second autocommit setting, what could potentially cause tlog to
grow? What to look for?

RE: Question on index time de-duplication

2015-10-30 Thread shamik

Thanks Markus. I've been using field collapsing till now but the performance
constraint is forcing me to think about index time de-duplication. I've been
using a composite router to make sure that duplicate documents are routed to
the same shard. Won't that work for SignatureUpdateProcessorFactory ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237403.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Question on index time de-duplication

2015-10-30 Thread shamik

Thanks Scott. I could directly use field collapsing on adskdedup field
without the signature field. Problem with field collapsing is the
performance overhead. It slows down the query to 10 folds.
CollapsingQParserPlugin is a better option, unfortunately, it doesn't
support ngroups equivalent, which is a requirement for me.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237401.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: growth of tlog

2015-10-30 Thread Erick Erickson

I infer that this statement: "takes a while to recover before cloud
becomes green"
indicates that the node is in recovery or something while indexing. If you're
still indexing, the new documents will be written to the followers
tlog while the
follower is recovering, leading to it growing. I expect that after followers
all recover, the tlog shrinks after a few commits have gone by.

If that's all true, the question is why the follower goes into
recovery in the first
place. Prior to 5.2, there was a situation in which very heavy indexing
could cause a follower to go into Leader Initiated Recovery (LIR) (look for this
in both the leader and follower logs). Here's the blog Tim Potter wrote
on this subject:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/

The smoking gun here is
1> heavy indexing is required
2> the _leader_ stays up
3> the _follower_ goes into recovery for no readily apparent reason
4> the nail in the coffin for this particular issue is seeing that the follower
 went into LIR.
5> You'll also see a very large number of threads on the leader waiting
  on sending the updates to the follower.

If this is a problem, prior to 5.2 there are really only two solutions
1> throttle indexing
2> take all of the followers offline during indexing. When indexing is
 completed, bring the followers back up and let them replicate the
 full index down from the leader.

Best,
Erick

On Fri, Oct 30, 2015 at 8:28 AM, Rallavagu  wrote:
> 4.10.4 solr cloud, 3 zk quorum, jdk 8
>
> autocommit: 15 sec, softcommit: 2 min
>
> Under heavy indexing load with above settings, i have seen tlog growing
> (into GB). After the updates stopped coming in, it settles down and takes a
> while to recover before cloud becomes "green".
>
> With 15 second autocommit setting, what could potentially cause tlog to
> grow? What to look for?

Re: growth of tlog

2015-10-30 Thread Shawn Heisey

On 10/30/2015 9:46 AM, Rallavagu wrote:
> Also, this affects available physical memory as tlog continues to grow
> and it is memory mapped.

I think this is a common misconception.

MMAP does *not* use up physical memory, at least not in the detrimental
way your sentence suggests.  Any memory (OS disk cache) used when
reading files this way can be immediately claimed by any program that
needs it.

https://en.wikipedia.org/wiki/Page_cache

MMAP allows programs to allocate LESS memory, not more.  It is far more
efficient if there is spare memory that is not explicitly needed by
applications, because of the OS disk cache.

Thanks,
Shawn

Re: Question on index time de-duplication

2015-10-30 Thread shamik

Thanks for your reply. Have you customized SignatureUpdateProcessorFactory or
are you using the configuration out of the box ? I know it works for simple
dedup, but my requirement is tad different as I need to tag an identifier to
the latest document. My goal is to understand if that's possible using
SignatureUpdateProcessorFactory. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237409.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 5.3.1 CREATE defaults to schema-less mode Java version 1.7.0_45

2015-10-30 Thread natasha

Hi Erick,

If I just run the following, I have no issue:

bin/solr start
curl '
http://localhost:8983/solr/admin/cores?action=CREATE=test-core=/home/natasha/twc-session-dash1/collection1

'
curl 'http://localhost:8983/solr/test-core/schema/fields'

As opposed to running bin/solr start -e cloud (which spins up an example)
before I load a core.

Thank you,
Natasha







On Fri, Oct 30, 2015 at 10:24 AM, Natasha Whitney 
wrote:

> Hi Erick,
>
> Thanks for your help. I am fairly new to Solr.
>
> I'm not set on using SolrCloud. No need for ZooKeeper or multiple leader
> nodes.
>
> What I have is an existing instanceDir (with a conf and data directory,
> with all requisite components) and I would like to create a new core based
> on this preexisting directory.
>
> To that effect, I am trying to use the SolrAdmin CREATE method. I have
> Solr 5.3.1 installed on my machine, and am trying to create a core where
> instanceDir is a path to the core (also on my machine). Even if I move
> instanceDir into the Solr root, I still see that the only fields returned
> in the schema are the default fields.
>
> If there is an preferable (non-cloud) way to create a core to point at an
> existing instanceDir please advise!
>
> Thank you,
> Natasha
>
>
>
> On Thu, Oct 29, 2015 at 8:58 PM, Erick Erickson [via Lucene] <
> ml-node+s472066n4237322...@n3.nabble.com> wrote:
>
>> I'm pretty confused about what you're trying to do. You mention using
>> the SolrCloud UI to look at your core, but on the other hand you also
>> mention using the core admin to create the core.
>>
>> Trying to use the core admin commands with SolrCloud is a recipe for
>> disaster. Under the covers, the _collections_ api does, indeed, use
>> the core admin API to create cores, but it really must be precisely
>> done. If you're going to try to create your own cores, I recommend
>> setting up a non-SolrCloud system.
>>
>> If you want to use SolrCloud, then I _strongly_ recommend you use the
>> collections API to create your collections. You can certainly have a
>> single-shard collection that would be a leader-only collection (i.e.
>> no followers), which would have only a single core cluster-wide if
>> that fits your architecture
>>
>> As it is, in cloud mode Solr expects the configs to be up on
>> Zookeeper, not resident on disk somewhere. And the admin core create
>> command promises that you have the configs in
>> /Users/nw/Downloads/twc-session-dash/collection1 which is a recipe for
>> confusion on Solr's part...
>>
>> HTH,
>> Erick
>>
>> On Thu, Oct 29, 2015 at 4:24 PM, natasha <[hidden email]
>> > wrote:
>>
>> > Note, if I attempt to CREATE the core using Solr 5.3.0 on my openstack
>> > machine (Java version 1.7.0) I have no issues.
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-5-3-1-CREATE-defaults-to-schema-less-mode-Java-version-1-7-0-45-tp4237305p4237307.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/Solr-5-3-1-CREATE-defaults-to-schema-less-mode-Java-version-1-7-0-45-tp4237305p4237322.html
>> To unsubscribe from Solr 5.3.1 CREATE defaults to schema-less mode Java
>> version 1.7.0_45, click here
>> 
>> .
>> NAML
>> 
>>
>
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-5-3-1-CREATE-defaults-to-schema-less-mode-Java-version-1-7-0-45-tp4237305p4237452.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sort not working as expected

2015-10-30 Thread Brian Narsi

Is there no way that the existing field can be used?


On Fri, Oct 30, 2015 at 1:42 PM, Ray Niu  wrote:

> you should use string type instead of text if you want to sort
> alphabetically
>
> 2015-10-30 11:12 GMT-07:00 Brian Narsi :
>
> > I have a fieldtype setup as
> >
> >  positionIncrementGap=
> > "100">   > "solr.StandardTokenizerFactory"/>  > "solr.LowerCaseFilterFactory"/>  class="solr.EdgeNGramFilterFactory"
> > minGramSize="3" maxGramSize="25"/>   <
> > tokenizer class="solr.StandardTokenizerFactory"/>  > "solr.LowerCaseFilterFactory"/>  
> >
> >
> > When I sort on this field type in ascending order I am not getting
> results
> > sorted alphabetically as expected.
> >
> > Why is that?
> >
> > What should I do to get the sort on?
> >
> > Thanks
> >
>

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Mikhail Khludnev

Perhaps
FileBasedSpellChecker
https://cwiki.apache.org/confluence/display/solr/Spell+Checking

On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler 
wrote:

> Hello everyone,
>
> I have a gigantic list of industry terms that I want to import into a
> Solr/Lucene instance running on an AWS box.  What is the fastest way to
> import the list into my Solr/Lucene instance?  I have admin/sudo privileges
> on the box.
>
> Also, is there a document that shows me how to set up my Solr/Lucene config
> file to be optimized for fast searches on single word entries using fuzzy
> search?  I intend to use this Solr/Lucene instance to do spell checking on
> the big industry word list I mentioned above.  Each data record will be a
> single word from the file.  I'll want to take a single word query and do a
> fuzzy search on the word against the index (Lichtenstein, max distance 2 as
> per Solr/Lucene's fuzzy search feature).  So what parameters will configure
> Solr/Lucene to be optimized for such a search?  Also, if a document shows
> the best index/read parameters to support single word fuzzy searching then
> that would be a big help too.  Note, the contents of the index will change
> very infrequently if that affects the optimal parameter mix.
>
>
> --
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Solr Keyword query on a specific field.

2015-10-30 Thread Aaron Gibbons

Is there any way to have a single field search use the same keyword search
logic as the default query? I define q.op as AND in my query which gets
applied to any main keywords but any keywords I'm trying to use within a
field do not get the same logic applied.
Example:
q=(title:(Test Keywords)) the space is treated as OR regardless of q.op
q=(Test Keywords) the space is defined by q.op which is AND

Using the correct operators (AND OR * - +...) it works great as I have it
defined. There's just this one little caveat when you use spaces between
keywords expecting the q.op operator to be applied.
Thanks,
Aaron

Re: Solr 5.3.1 CREATE defaults to schema-less mode Java version 1.7.0_45

2015-10-30 Thread natasha

Hi Erick,

Thanks for your help. I am fairly new to Solr.

I'm not set on using SolrCloud. No need for ZooKeeper or multiple leader
nodes.

What I have is an existing instanceDir (with a conf and data directory,
with all requisite components) and I would like to create a new core based
on this preexisting directory.

To that effect, I am trying to use the SolrAdmin CREATE method. I have Solr
5.3.1 installed on my machine, and am trying to create a core where
instanceDir is a path to the core (also on my machine). Even if I move
instanceDir into the Solr root, I still see that the only fields returned
in the schema are the default fields.

If there is an preferable (non-cloud) way to create a core to point at an
existing instanceDir please advise!

Thank you,
Natasha



On Thu, Oct 29, 2015 at 8:58 PM, Erick Erickson [via Lucene] <
ml-node+s472066n4237322...@n3.nabble.com> wrote:

> I'm pretty confused about what you're trying to do. You mention using
> the SolrCloud UI to look at your core, but on the other hand you also
> mention using the core admin to create the core.
>
> Trying to use the core admin commands with SolrCloud is a recipe for
> disaster. Under the covers, the _collections_ api does, indeed, use
> the core admin API to create cores, but it really must be precisely
> done. If you're going to try to create your own cores, I recommend
> setting up a non-SolrCloud system.
>
> If you want to use SolrCloud, then I _strongly_ recommend you use the
> collections API to create your collections. You can certainly have a
> single-shard collection that would be a leader-only collection (i.e.
> no followers), which would have only a single core cluster-wide if
> that fits your architecture
>
> As it is, in cloud mode Solr expects the configs to be up on
> Zookeeper, not resident on disk somewhere. And the admin core create
> command promises that you have the configs in
> /Users/nw/Downloads/twc-session-dash/collection1 which is a recipe for
> confusion on Solr's part...
>
> HTH,
> Erick
>
> On Thu, Oct 29, 2015 at 4:24 PM, natasha <[hidden email]
> > wrote:
>
> > Note, if I attempt to CREATE the core using Solr 5.3.0 on my openstack
> > machine (Java version 1.7.0) I have no issues.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-5-3-1-CREATE-defaults-to-schema-less-mode-Java-version-1-7-0-45-tp4237305p4237307.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Solr-5-3-1-CREATE-defaults-to-schema-less-mode-Java-version-1-7-0-45-tp4237305p4237322.html
> To unsubscribe from Solr 5.3.1 CREATE defaults to schema-less mode Java
> version 1.7.0_45, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-5-3-1-CREATE-defaults-to-schema-less-mode-Java-version-1-7-0-45-tp4237305p4237426.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sort not working as expected

2015-10-30 Thread Ray Niu

you should use string type instead of text if you want to sort
alphabetically

2015-10-30 11:12 GMT-07:00 Brian Narsi :

> I have a fieldtype setup as
>
>  "100">   "solr.StandardTokenizerFactory"/>  "solr.LowerCaseFilterFactory"/>  minGramSize="3" maxGramSize="25"/>   <
> tokenizer class="solr.StandardTokenizerFactory"/>  "solr.LowerCaseFilterFactory"/>  
>
>
> When I sort on this field type in ascending order I am not getting results
> sorted alphabetically as expected.
>
> Why is that?
>
> What should I do to get the sort on?
>
> Thanks
>

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood

Is there some reason that you don’t want to use aspell with a custom 
dictionary? Lucene and Solr are pretty weak compared to purpose-built spelling 
checkers.

http://aspell.net/ 

Also, consider the Peter Norvig spell corrector approach. With a fixed list, it 
is blazing fast. In only 21 lines of Python.

http://norvig.com/spell-correct.html 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 30, 2015, at 11:37 AM, Robert Oschler  wrote:
> 
> Hello everyone,
> 
> I have a gigantic list of industry terms that I want to import into a
> Solr/Lucene instance running on an AWS box.  What is the fastest way to
> import the list into my Solr/Lucene instance?  I have admin/sudo privileges
> on the box.
> 
> Also, is there a document that shows me how to set up my Solr/Lucene config
> file to be optimized for fast searches on single word entries using fuzzy
> search?  I intend to use this Solr/Lucene instance to do spell checking on
> the big industry word list I mentioned above.  Each data record will be a
> single word from the file.  I'll want to take a single word query and do a
> fuzzy search on the word against the index (Lichtenstein, max distance 2 as
> per Solr/Lucene's fuzzy search feature).  So what parameters will configure
> Solr/Lucene to be optimized for such a search?  Also, if a document shows
> the best index/read parameters to support single word fuzzy searching then
> that would be a big help too.  Note, the contents of the index will change
> very infrequently if that affects the optimal parameter mix.
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/

Re: Question on index time de-duplication

2015-10-30 Thread Scott Stults

At the top of the De-Duplication wiki page is a note about collapsing
results. Once you have the signature (identical for each of the duplicates)
you'll want to collapse your results, keeping the one with max date.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results


k/r,
Scott

On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo 
wrote:

> Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
> of the content to a signature field, and group the signature field during
> your search.
>
> You can find more information here:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
> I have been using this method to group the index with duplicated content,
> and it is working fine.
>
> Regards,
> Edwin
>
>
> On 30 October 2015 at 07:20, Shamik Bandopadhyay 
> wrote:
>
> > Hi,
> >
> >   I'm looking to customizing index time de-duplication. Here's my use
> case
> > and what I'm trying to achieve.
> >
> > I've identical documents coming from different release year of a given
> > product. I need to index them in Solr as they are required in individual
> > year context. But there's a generic search which spans across all the
> years
> > and hence bring back duplicate/identical content. My goal is to only
> return
> > the latest document and filter out the rest. For e.g. if product A has
> > identical documents for 2015, 2014 and 2013, search should only return
> 2015
> > (latest document) and filter out the rest.
> >
> > What I'm thinking (if possible) during index time :
> >
> > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> > 2014 content, keeping 2015 (the latest release) untouched. During query
> > time, I'll add a filter which will exclude contents tagged with "dedup".
> >
> > Just wondering if this is achievable by perhaps extending
> > UpdateRequestProcessorFactory or
> > customizing SignatureUpdateProcessorFactory ?
> >
> > Any pointers will be appreciated.
> >
> > Regards,
> > Shamik
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Performance degradation with two collection on same sole instance

2015-10-30 Thread SolrUser1543

we have 100 gb ram on each machine . 20 gb - for heap . index size of big
collection is 130 gb . the new second collection has only few documents ,
only few MB  .

When we disabled new cores , performance has improved . 

Both collection using same solr.config , so they have same filter
configurations .

But the second collection very small , only few documents , so cache can not
consume a memory .





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-degradation-with-two-collection-on-same-sole-instance-tp4236774p4237354.html
Sent from the Solr - User mailing list archive at Nabble.com.

Sort not working as expected

2015-10-30 Thread Brian Narsi

I have a fieldtype setup as

   <
tokenizer class="solr.StandardTokenizerFactory"/>   


When I sort on this field type in ascending order I am not getting results
sorted alphabetically as expected.

Why is that?

What should I do to get the sort on?

Thanks

Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Robert Oschler

Hello everyone,

I have a gigantic list of industry terms that I want to import into a
Solr/Lucene instance running on an AWS box.  What is the fastest way to
import the list into my Solr/Lucene instance?  I have admin/sudo privileges
on the box.

Also, is there a document that shows me how to set up my Solr/Lucene config
file to be optimized for fast searches on single word entries using fuzzy
search?  I intend to use this Solr/Lucene instance to do spell checking on
the big industry word list I mentioned above.  Each data record will be a
single word from the file.  I'll want to take a single word query and do a
fuzzy search on the word against the index (Lichtenstein, max distance 2 as
per Solr/Lucene's fuzzy search feature).  So what parameters will configure
Solr/Lucene to be optimized for such a search?  Also, if a document shows
the best index/read parameters to support single word fuzzy searching then
that would be a big help too.  Note, the contents of the index will change
very infrequently if that affects the optimal parameter mix.


-- 
Thanks,
Robert Oschler
Twitter -> http://twitter.com/roschler
http://www.RobotsRule.com/
http://www.Robodance.com/

Re: Solr 5.3.1 CREATE defaults to schema-less mode Java version 1.7.0_45

2015-10-30 Thread Upayavira



On Fri, Oct 30, 2015, at 07:03 PM, natasha wrote:
> Hi Erick,
> 
> If I just run the following, I have no issue:
> 
> bin/solr start
> curl '
> http://localhost:8983/solr/admin/cores?action=CREATE=test-core=/home/natasha/twc-session-dash1/collection1
> 
> '
> curl 'http://localhost:8983/solr/test-core/schema/fields'
> 
> As opposed to running bin/solr start -e cloud (which spins up an example)
> before I load a core.

Well yes, the -e cloud option is starting it in SolrCloud mode, in which
case you should be using the collections API to create a collection. The
two scenarios above *aren't* the same.

Upayavira

46 matches

Mail list logo