date:20150508

Hi Alessandro,

I'm using Solr 5.0.0, but it is still able to work. Actually I found this
to be better than query~1 or query~2, as it can automatically detect
and allow the 20% error rate that I want.

For this query~1 or query~2, does it mean that I'll have to manually
detect how many characters did I enter, before I assign the suitable ~(tilde)
param in order to achieve the 20% error rate?
I'll probably need an edit distance of 0 for words with 3 or less
characters, 1 for words with 4 to 9 characters, edit distance of 2 for
words with 10 to 14 characters, and edit distance of 3 for words with more
than 15 characters.

Yes, for the performance I'm checking if the length check will affect the
query time. Thanks for your info on that. Currently my index is small, so
everything seems to run quite fast and the delay is un-noticeable. But not
so sure if it will slow down till it is noticeable by the user if I have
tens of collections with millions of records.


Regards,
Edwin



On 8 May 2015 at 16:53, Alessandro Benedetti benedetti.ale...@gmail.com
wrote:

 Hi Zheng,
 actually that version of the fuzzy search is deprecated!
 Currently the fuzzy search syntax is :
 query~1 or query~2
 The ~(tilde)  param is the number of edit we provide to generate all the
 expanded query to run.
 Can I ask you which version of Solr are you using ?

 This article from 2011 shows the biggest change in fuzzy query, and I guess
 it's still the current approach!
 Related the performance, what do you mean ?
 Are you worried if the length check will affect the query time ?
 The answer is yes, but the delay will be un-noticeable as you simply check
 the length and apply the proper fuzzy param related.
 Regarding the fact fuzzy query being slower than a normal query, that is
 true, but the FST approach guarantee really fast fuzzy query.
 So if you do need the fuzziness, it's something you can cope with.

 Cheers

 2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

  Thank you for the information.
 
  I've currently using the fuzzy search and set the edit distance value to
  ~0.79, and this has allowed a 20% error rate. (ie for words with 5
  characters, it allows 1 mis-spelled character, and for words with 10
  characters, it allows 2 mis-speed characters).
 
  However, for words with 4 characters, I'll need to set the value to ~0.75
  to allow 1 mis-spelled character, as in order to accommodate 4 characters
  word, it requires a 25% error rate for 1 mis-spelled character. We
 probably
  will not accommodate for 3 characters word.
 
  I've gotten the information from here:
 
 http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches

 
  Just to check, will this affect the performance of the system?
 
  Regards,
  Edwin
 
 
  On 7 May 2015 at 20:00, Alessandro Benedetti benedetti.ale...@gmail.com
 
  wrote:
 
   Hi !
   Currently Solr builds FST to provide proper fuzzy search or spellcheck
   suggestions based on the string distance .
   The current default algorithm is the Levenstein distance ( that returns
  the
   number of edit as distance metric).
   In your case you should calculate client side, the edit you want to
 apply
   to your search.
   In your client code, should be not difficult to process the query and
  apply
   the proper number of edit depending on the length.
  
   Anyway the max edit for the levenstein default distance is fixed to 2 .
  
   Cheers
  
  
  
   2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:
  
Hi,
   
Would like to check, how do we implement character proximity
 searching
that's in terms of percentage with regards to the length of the word,
instead of a fixed number of edit distance (characters)?
   
For example, if we have a proximity of 20%, a word with 5 characters
  will
have an edit distance of 1, and a word with 10 characters will
automatically have an edit distance of 2.
   
Will Solr be able to do that for us?
   
Regards,
Edwin
   
  
  
  
   --
   --
  
   Benedetti Alessandro
   Visiting card : http://about.me/alessandro_benedetti
  
   Tyger, tyger burning bright
   In the forests of the night,
   What immortal hand or eye
   Could frame thy fearful symmetry?
  
   William Blake - Songs of Experience -1794 England
  
 



 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England

Queries on SynonymFilterFactory

Hi,

Will like to check, for the SynonymFilterFactory, I have the following in
my synonyms.txt:

Titanium Dioxides, titanium oxide, pigment
pigment, colour, colouring material

If I set expend=false, and I search for q=pigment, I will get results that
matches pigment, Titanium Dioxides and titanium oxide. But it will not
maches colour and colouring materials, as all equivalent synonyms will only
matches those first in the list.

If I set expend=false, and I search for q=pigment, I'll get results that
matches everything in the list (ie: Titanium Dioxides, titanium oxide,
colour, colouring material)

Is my understand correct?

Also, I will like to check, how come if I search q=pigment (enclosed in
quotes), I only get matches for Titanium Dioxides and not pigment?

Regards,
Edwin

Re: Proximity searching in percentage

Hi Zheng,
actually that version of the fuzzy search is deprecated!
Currently the fuzzy search syntax is :
query~1 or query~2
The ~(tilde)  param is the number of edit we provide to generate all the
expanded query to run.
Can I ask you which version of Solr are you using ?

This article from 2011 shows the biggest change in fuzzy query, and I guess
it's still the current approach!
Related the performance, what do you mean ?
Are you worried if the length check will affect the query time ?
The answer is yes, but the delay will be un-noticeable as you simply check
the length and apply the proper fuzzy param related.
Regarding the fact fuzzy query being slower than a normal query, that is
true, but the FST approach guarantee really fast fuzzy query.
So if you do need the fuzziness, it's something you can cope with.

Cheers

2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 Thank you for the information.

 I've currently using the fuzzy search and set the edit distance value to
 ~0.79, and this has allowed a 20% error rate. (ie for words with 5
 characters, it allows 1 mis-spelled character, and for words with 10
 characters, it allows 2 mis-speed characters).

 However, for words with 4 characters, I'll need to set the value to ~0.75
 to allow 1 mis-spelled character, as in order to accommodate 4 characters
 word, it requires a 25% error rate for 1 mis-spelled character. We probably
 will not accommodate for 3 characters word.

 I've gotten the information from here:
 http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches
 
 http://mail.growhill.com/cgi-bin/webmanager/webmail.cgi?cmd=urlxdata=~2-dd4639fc876fef5244efd32efa438fb90296a3eadadba2c6d7ce00url=http!3A!2F!2Flucene.apache.org!2Fcore!2F3_6_0!2Fqueryparsersyntax.html!23Fuzzy!2520Searches
 

 Just to check, will this affect the performance of the system?

 Regards,
 Edwin


 On 7 May 2015 at 20:00, Alessandro Benedetti benedetti.ale...@gmail.com
 wrote:

  Hi !
  Currently Solr builds FST to provide proper fuzzy search or spellcheck
  suggestions based on the string distance .
  The current default algorithm is the Levenstein distance ( that returns
 the
  number of edit as distance metric).
  In your case you should calculate client side, the edit you want to apply
  to your search.
  In your client code, should be not difficult to process the query and
 apply
  the proper number of edit depending on the length.
 
  Anyway the max edit for the levenstein default distance is fixed to 2 .
 
  Cheers
 
 
 
  2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:
 
   Hi,
  
   Would like to check, how do we implement character proximity searching
   that's in terms of percentage with regards to the length of the word,
   instead of a fixed number of edit distance (characters)?
  
   For example, if we have a proximity of 20%, a word with 5 characters
 will
   have an edit distance of 1, and a word with 10 characters will
   automatically have an edit distance of 2.
  
   Will Solr be able to do that for us?
  
   Regards,
   Edwin
  
 
 
 
  --
  --
 
  Benedetti Alessandro
  Visiting card : http://about.me/alessandro_benedetti
 
  Tyger, tyger burning bright
  In the forests of the night,
  What immortal hand or eye
  Could frame thy fearful symmetry?
 
  William Blake - Songs of Experience -1794 England
 




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Queries on SynonymFilterFactory

Just an update, the tokenizer class which I'm using is
StandardTokenizerFactory, and I'm using Solr 5.0.
On 8 May 2015 16:24, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote:

 Hi,

 Will like to check, for the SynonymFilterFactory, I have the following in
 my synonyms.txt:

 Titanium Dioxides, titanium oxide, pigment
 pigment, colour, colouring material

 If I set expend=false, and I search for q=pigment, I will get results that
 matches pigment, Titanium Dioxides and titanium oxide. But it will not
 maches colour and colouring materials, as all equivalent synonyms will only
 matches those first in the list.

 If I set expend=false, and I search for q=pigment, I'll get results that
 matches everything in the list (ie: Titanium Dioxides, titanium oxide,
 colour, colouring material)

 Is my understand correct?

 Also, I will like to check, how come if I search q=pigment (enclosed in
 quotes), I only get matches for Titanium Dioxides and not pigment?

 Regards,
 Edwin

Re: Queries on SynonymFilterFactory

Let's explain  little bit better here :
First of all, the SynonimFilter is a Token Filter, and being a Token Filter
it can be part of an Analysis pipeline at Indexing and Query Time.
As the different type of analysis explicitly explains when the filtering
happens, let's go to the details of the synonyms.txt.
This file contains a set of lines, each of them describing a synonym policy.
There are 2 different syntaxes accepted :




*couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall =
tiny,teeny,weeny*


   - A comma-separated list of words. If the token matches any of the
   words, then all the words in the list are substituted, which will include
   the original token.


   - Two comma-separated lists of words with the symbol = between them.
   If the token matches any word on the left, then the list on the right is
   substituted. The original token will not be included unless it is also in
   the list on the right.


Related the expand param, directly from the official Solr documentation :

expand: (optional; default: true) If true, a synonym will be expanded to
all equivalent synonyms. If false, all equivalent synonyms will be reduced
to the first in the list.

So, starting from this definition let's answer to your questions:

1) Related the expand the definition seems quite clear, if anything strange
is occurring to you, let me know
2) Related your second question, it depends on your synonym.txt file, if
you are not using the = syntax, you are going to always retrieve all
the synonyms(
included the original term)

If you need more info let me know, it can strictly depends how you are
using the filter as well ( indexing ? querying ? both ? )
Example :
If you are using the filter only at Indexing time, then using the = syntax
will prevent the user to search for the original token in the synonym.txt
relation.
Because it will not appear in the index.

Cheers


2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 Hi,

 Will like to check, for the SynonymFilterFactory, I have the following in
 my synonyms.txt:

 Titanium Dioxides, titanium oxide, pigment
 pigment, colour, colouring material

 If I set expend=false, and I search for q=pigment, I will get results that
 matches pigment, Titanium Dioxides and titanium oxide. But it will not
 maches colour and colouring materials, as all equivalent synonyms will only
 matches those first in the list.

 If I set expend=false, and I search for q=pigment, I'll get results that
 matches everything in the list (ie: Titanium Dioxides, titanium oxide,
 colour, colouring material)

 Is my understand correct?

 Also, I will like to check, how come if I search q=pigment (enclosed in
 quotes), I only get matches for Titanium Dioxides and not pigment?

 Regards,
 Edwin




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Proximity searching in percentage

2015-05-08 10:14 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 Hi Alessandro,

 I'm using Solr 5.0.0, but it is still able to work. Actually I found this
 to be better than query~1 or query~2, as it can automatically detect
 and allow the 20% error rate that I want.

I don't think that the double param is supported anymore, so we should
take a look to tricky formula underline to understand how the exact edits
are calculated



 For this query~1 or query~2, does it mean that I'll have to manually
 detect how many characters did I enter, before I assign the suitable
 ~(tilde)
 param in order to achieve the 20% error rate?

Yes

 I'll probably need an edit distance of 0 for words with 3 or less
 characters, 1 for words with 4 to 9 characters, edit distance of 2 for
 words with 10 to 14 characters, and edit distance of 3 for words with more
 than 15 characters.

This would be quite easy, just check the length and assign the proper edit
accordingly to your requirements.


 Yes, for the performance I'm checking if the length check will affect the
 query time. Thanks for your info on that. Currently my index is small, so
 everything seems to run quite fast and the delay is un-noticeable. But not
 so sure if it will slow down till it is noticeable by the user if I have
 tens of collections with millions of records.

I think the length check will be constant time for any string ( if you are
using java , most likely to be constant in all other languages)
So i would say it won't be a problem in comparison with the actual query
time.



 Regards,
 Edwin



 On 8 May 2015 at 16:53, Alessandro Benedetti benedetti.ale...@gmail.com
 wrote:

  Hi Zheng,
  actually that version of the fuzzy search is deprecated!
  Currently the fuzzy search syntax is :
  query~1 or query~2
  The ~(tilde)  param is the number of edit we provide to generate all the
  expanded query to run.
  Can I ask you which version of Solr are you using ?
 
  This article from 2011 shows the biggest change in fuzzy query, and I
 guess
  it's still the current approach!
  Related the performance, what do you mean ?
  Are you worried if the length check will affect the query time ?
  The answer is yes, but the delay will be un-noticeable as you simply
 check
  the length and apply the proper fuzzy param related.
  Regarding the fact fuzzy query being slower than a normal query, that is
  true, but the FST approach guarantee really fast fuzzy query.
  So if you do need the fuzziness, it's something you can cope with.
 
  Cheers
 
  2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:
 
   Thank you for the information.
  
   I've currently using the fuzzy search and set the edit distance value
 to
   ~0.79, and this has allowed a 20% error rate. (ie for words with 5
   characters, it allows 1 mis-spelled character, and for words with 10
   characters, it allows 2 mis-speed characters).
  
   However, for words with 4 characters, I'll need to set the value to
 ~0.75
   to allow 1 mis-spelled character, as in order to accommodate 4
 characters
   word, it requires a 25% error rate for 1 mis-spelled character. We
  probably
   will not accommodate for 3 characters word.
  
   I've gotten the information from here:
  
 
 http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches
 
  
   Just to check, will this affect the performance of the system?
  
   Regards,
   Edwin
  
  
   On 7 May 2015 at 20:00, Alessandro Benedetti 
 benedetti.ale...@gmail.com
  
   wrote:
  
Hi !
Currently Solr builds FST to provide proper fuzzy search or
 spellcheck
suggestions based on the string distance .
The current default algorithm is the Levenstein distance ( that
 returns
   the
number of edit as distance metric).
In your case you should calculate client side, the edit you want to
  apply
to your search.
In your client code, should be not difficult to process the query and
   apply
the proper number of edit depending on the length.
   
Anyway the max edit for the levenstein default distance is fixed to
 2 .
   
Cheers
   
   
   
2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com
 :
   
 Hi,

 Would like to check, how do we implement character proximity
  searching
 that's in terms of percentage with regards to the length of the
 word,
 instead of a fixed number of edit distance (characters)?

 For example, if we have a proximity of 20%, a word with 5
 characters
   will
 have an edit distance of 1, and a word with 10 characters will
 automatically have an edit distance of 2.

 Will Solr be able to do that for us?

 Regards,
 Edwin

   
   
   
--
--
   
Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti
   
Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

Re: Queries on SynonymFilterFactory

Thanks for explaining the information.

Currently I'm only using the comma-separated list of words and only using
the synonym filter at query time. I find that when I set expend = true,
there's quite a number of irrelevant results that came back, and this
didn't happen when I set expend = false.

I've yet to try the lists of words with the symbol = between them. I'm
trying to solve the multi-word synonyms too, and I found that enclosing the
multi-word in quotes will solve the issue. But this creates problem and the
original token is not return if I enclose single word in quotes.

Will using the lists of words with the symbol = between them better than
the comma-separated list of words to cater to the multi-word synonyms?

Regards,
Edwin



On 8 May 2015 at 17:10, Alessandro Benedetti benedetti.ale...@gmail.com
wrote:

 Let's explain  little bit better here :
 First of all, the SynonimFilter is a Token Filter, and being a Token Filter
 it can be part of an Analysis pipeline at Indexing and Query Time.
 As the different type of analysis explicitly explains when the filtering
 happens, let's go to the details of the synonyms.txt.
 This file contains a set of lines, each of them describing a synonym
 policy.
 There are 2 different syntaxes accepted :




 *couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall =
 tiny,teeny,weeny*


- A comma-separated list of words. If the token matches any of the
words, then all the words in the list are substituted, which will
 include
the original token.


- Two comma-separated lists of words with the symbol = between them.
If the token matches any word on the left, then the list on the right is
substituted. The original token will not be included unless it is also
 in
the list on the right.


 Related the expand param, directly from the official Solr documentation :

 expand: (optional; default: true) If true, a synonym will be expanded to
 all equivalent synonyms. If false, all equivalent synonyms will be reduced
 to the first in the list.

 So, starting from this definition let's answer to your questions:

 1) Related the expand the definition seems quite clear, if anything strange
 is occurring to you, let me know
 2) Related your second question, it depends on your synonym.txt file, if
 you are not using the = syntax, you are going to always retrieve all
 the synonyms(
 included the original term)

 If you need more info let me know, it can strictly depends how you are
 using the filter as well ( indexing ? querying ? both ? )
 Example :
 If you are using the filter only at Indexing time, then using the = syntax
 will prevent the user to search for the original token in the synonym.txt
 relation.
 Because it will not appear in the index.

 Cheers


 2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

  Hi,
 
  Will like to check, for the SynonymFilterFactory, I have the following in
  my synonyms.txt:
 
  Titanium Dioxides, titanium oxide, pigment
  pigment, colour, colouring material
 
  If I set expend=false, and I search for q=pigment, I will get results
 that
  matches pigment, Titanium Dioxides and titanium oxide. But it will not
  maches colour and colouring materials, as all equivalent synonyms will
 only
  matches those first in the list.
 
  If I set expend=false, and I search for q=pigment, I'll get results that
  matches everything in the list (ie: Titanium Dioxides, titanium oxide,
  colour, colouring material)
 
  Is my understand correct?
 
  Also, I will like to check, how come if I search q=pigment (enclosed in
  quotes), I only get matches for Titanium Dioxides and not pigment?
 
  Regards,
  Edwin
 



 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England

Re: Solr 5.1.0 Cloud and Zookeeper

2015-05-08 Thread Christos Manios

Hello Shacky,

I have recently performed a manual installation of a Zookeeper ensemble (3
zookeepers) in the same machine. I used the upstart init script from
official .deb configuration
https://svn.apache.org/repos/asf/zookeeper/trunk/src/packages/deb/init.d/zookeeper
and modified it in order to meet my custom installation.

You can see the changes I made in original zookeeper.sh in this gist
https://gist.github.com/manios/e4fbd700e0d8999f5e17 and also the
modifications in other files in conf/ directory of Zookeeper distribution.

Best regards,
Christos

2015-05-05 16:35 GMT+03:00 shacky shack...@gmail.com:

Thank you very much for your answer.
I installed ZooKeeper 3.4.6 on my Debian (Wheezy) system, and it's working
well.
The only problem I have is that I'm looking for some init script but I
cannot find anything. I'm also trying to adapt the script in Debian's
zookeeperd package, but I have some problems.
Do you know some working init scripts for ZooKeeper on Debian?

2015-05-05 15:30 GMT+02:00 Mark Miller markrmil...@gmail.com:
A bug fix version difference probably won't matter. It's best to use the
same version everyone else uses and the one our tests use, but it's very
likely 3.4.5 will work without a hitch.

- Mark

On Tue, May 5, 2015 at 9:09 AM shacky shack...@gmail.com wrote:

Hi.

I read on

https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
that Solr needs to use the same ZooKeeper version it owns (at the
moment 3.4.6).
Debian Jessie has ZooKeeper 3.4.5
(https://packages.debian.org/jessie/zookeeper).

Are you sure that this version won't work with Solr 5.1.0?

Thank you very much for your help!
Bye

Re: Queries on SynonymFilterFactory

So it means like having more than 10 or 20 synonym files locally will still
be faster than accessing external service?

As I found out that zookeeper only allows the synonym.txt file to be a
maximum of 1MB, and as my potential synonym file is more than 20MB, I'll
need to split the file to more than 20 of them.

Regards,
Edwin

Re: Queries on SynonymFilterFactory

The document seems to point to using AutoPhrasingTokenFilter, putting an
underscore to the multi-term or changing to index time synonyms.

I'm also thinking of putting the synonyms onto a database or query some
thesaurus website when the using enter the search key, instead of using the
SynonymFilterFactory.

For this, once user enter a search key, the program will retrieve the list
of synonyms. Then I'll append the list to the search parameters (ie: q).
I'll use the boosting relevancy to give the original term a higher boost,
and the synonyms a lower boost.

Is this a good solution?

Regards,
Edwin
On 8 May 2015 17:40, Alessandro Benedetti benedetti.ale...@gmail.com
wrote:

I found this very interesting article that I think can help in better
understanding the problem :

http://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/

And this :

http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/

Take a look and let me know !

2015-05-08 10:26 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

Thanks for explaining the information.

Currently I'm only using the comma-separated list of words and only using
the synonym filter at query time. I find that when I set expend = true,
there's quite a number of irrelevant results that came back, and this
didn't happen when I set expend = false.

I've yet to try the lists of words with the symbol = between them. I'm
trying to solve the multi-word synonyms too, and I found that enclosing
the
multi-word in quotes will solve the issue. But this creates problem and
the
original token is not return if I enclose single word in quotes.

Will using the lists of words with the symbol = between them better
than
the comma-separated list of words to cater to the multi-word synonyms?

Regards,
Edwin

On 8 May 2015 at 17:10, Alessandro Benedetti benedetti.ale...@gmail.com

wrote:

Let's explain little bit better here :
First of all, the SynonimFilter is a Token Filter, and being a Token
Filter
it can be part of an Analysis pipeline at Indexing and Query Time.
As the different type of analysis explicitly explains when the
filtering
happens, let's go to the details of the synonyms.txt.
This file contains a set of lines, each of them describing a synonym
policy.
There are 2 different syntaxes accepted :

*couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall =
tiny,teeny,weeny*

- A comma-separated list of words. If the token matches any of the
words, then all the words in the list are substituted, which will
include
the original token.

- Two comma-separated lists of words with the symbol = between
them.
If the token matches any word on the left, then the list on the
right
is
substituted. The original token will not be included unless it is
also
in
the list on the right.

Related the expand param, directly from the official Solr
documentation :

expand: (optional; default: true) If true, a synonym will be expanded
to
all equivalent synonyms. If false, all equivalent synonyms will be
reduced
to the first in the list.

So, starting from this definition let's answer to your questions:

1) Related the expand the definition seems quite clear, if anything
strange
is occurring to you, let me know
2) Related your second question, it depends on your synonym.txt file,
if
you are not using the = syntax, you are going to always retrieve all
the synonyms(
included the original term)

If you need more info let me know, it can strictly depends how you are
using the filter as well ( indexing ? querying ? both ? )
Example :
If you are using the filter only at Indexing time, then using the =
syntax
will prevent the user to search for the original token in the
synonym.txt
relation.
Because it will not appear in the index.

Cheers

2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

Hi,

Will like to check, for the SynonymFilterFactory, I have the
following
in
my synonyms.txt:

Titanium Dioxides, titanium oxide, pigment
pigment, colour, colouring material

If I set expend=false, and I search for q=pigment, I will get results
that
matches pigment, Titanium Dioxides and titanium oxide. But it will
not
maches colour and colouring materials, as all equivalent synonyms
will
only
matches those first in the list.

If I set expend=false, and I search for q=pigment, I'll get results
that
matches everything in the list (ie: Titanium Dioxides, titanium
oxide,
colour, colouring material)

Is my understand correct?

Also, I will like to check, how come if I search q=pigment
(enclosed
in
quotes), I only get matches for Titanium Dioxides and not pigment?

Re: Proximity searching in percentage

Hi Alessandro,

Thank you so much for the info. Will try that out.

Regards,
Edwin
On 8 May 2015 17:27, Alessandro Benedetti benedetti.ale...@gmail.com
wrote:

 2015-05-08 10:14 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

  Hi Alessandro,
 
  I'm using Solr 5.0.0, but it is still able to work. Actually I found this
  to be better than query~1 or query~2, as it can automatically detect
  and allow the 20% error rate that I want.
 
 I don't think that the double param is supported anymore, so we should
 take a look to tricky formula underline to understand how the exact edits
 are calculated


 
  For this query~1 or query~2, does it mean that I'll have to manually
  detect how many characters did I enter, before I assign the suitable
  ~(tilde)
  param in order to achieve the 20% error rate?
 
 Yes

  I'll probably need an edit distance of 0 for words with 3 or less
  characters, 1 for words with 4 to 9 characters, edit distance of 2 for
  words with 10 to 14 characters, and edit distance of 3 for words with
 more
  than 15 characters.
 
 This would be quite easy, just check the length and assign the proper edit
 accordingly to your requirements.

 
  Yes, for the performance I'm checking if the length check will affect the
  query time. Thanks for your info on that. Currently my index is small, so
  everything seems to run quite fast and the delay is un-noticeable. But
 not
  so sure if it will slow down till it is noticeable by the user if I have
  tens of collections with millions of records.
 
 I think the length check will be constant time for any string ( if you are
 using java , most likely to be constant in all other languages)
 So i would say it won't be a problem in comparison with the actual query
 time.

 
 
  Regards,
  Edwin
 
 
 
  On 8 May 2015 at 16:53, Alessandro Benedetti benedetti.ale...@gmail.com
 
  wrote:
 
   Hi Zheng,
   actually that version of the fuzzy search is deprecated!
   Currently the fuzzy search syntax is :
   query~1 or query~2
   The ~(tilde)  param is the number of edit we provide to generate all
 the
   expanded query to run.
   Can I ask you which version of Solr are you using ?
  
   This article from 2011 shows the biggest change in fuzzy query, and I
  guess
   it's still the current approach!
   Related the performance, what do you mean ?
   Are you worried if the length check will affect the query time ?
   The answer is yes, but the delay will be un-noticeable as you simply
  check
   the length and apply the proper fuzzy param related.
   Regarding the fact fuzzy query being slower than a normal query, that
 is
   true, but the FST approach guarantee really fast fuzzy query.
   So if you do need the fuzziness, it's something you can cope with.
  
   Cheers
  
   2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:
  
Thank you for the information.
   
I've currently using the fuzzy search and set the edit distance value
  to
~0.79, and this has allowed a 20% error rate. (ie for words with 5
characters, it allows 1 mis-spelled character, and for words with 10
characters, it allows 2 mis-speed characters).
   
However, for words with 4 characters, I'll need to set the value to
  ~0.75
to allow 1 mis-spelled character, as in order to accommodate 4
  characters
word, it requires a 25% error rate for 1 mis-spelled character. We
   probably
will not accommodate for 3 characters word.
   
I've gotten the information from here:
   
  
 
 http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches
  
   
Just to check, will this affect the performance of the system?
   
Regards,
Edwin
   
   
On 7 May 2015 at 20:00, Alessandro Benedetti 
  benedetti.ale...@gmail.com
   
wrote:
   
 Hi !
 Currently Solr builds FST to provide proper fuzzy search or
  spellcheck
 suggestions based on the string distance .
 The current default algorithm is the Levenstein distance ( that
  returns
the
 number of edit as distance metric).
 In your case you should calculate client side, the edit you want to
   apply
 to your search.
 In your client code, should be not difficult to process the query
 and
apply
 the proper number of edit depending on the length.

 Anyway the max edit for the levenstein default distance is fixed to
  2 .

 Cheers



 2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo 
 edwinye...@gmail.com
  :

  Hi,
 
  Would like to check, how do we implement character proximity
   searching
  that's in terms of percentage with regards to the length of the
  word,
  instead of a fixed number of edit distance (characters)?
 
  For example, if we have a proximity of 20%, a word with 5
  characters
will
  have an edit distance of 1, and a word with 10 characters will
  automatically have an edit distance of 2.
 
  Will Solr be able to do that for us?

Re: Queries on SynonymFilterFactory

I found this very interesting article that I think can help in better
understanding the problem :
http://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/

And this :
http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/

Take a look and let me know !

2015-05-08 10:26 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 Thanks for explaining the information.

 Currently I'm only using the comma-separated list of words and only using
 the synonym filter at query time. I find that when I set expend = true,
 there's quite a number of irrelevant results that came back, and this
 didn't happen when I set expend = false.

 I've yet to try the lists of words with the symbol = between them. I'm
 trying to solve the multi-word synonyms too, and I found that enclosing the
 multi-word in quotes will solve the issue. But this creates problem and the
 original token is not return if I enclose single word in quotes.

 Will using the lists of words with the symbol = between them better than
 the comma-separated list of words to cater to the multi-word synonyms?

 Regards,
 Edwin



 On 8 May 2015 at 17:10, Alessandro Benedetti benedetti.ale...@gmail.com
 wrote:

  Let's explain  little bit better here :
  First of all, the SynonimFilter is a Token Filter, and being a Token
 Filter
  it can be part of an Analysis pipeline at Indexing and Query Time.
  As the different type of analysis explicitly explains when the filtering
  happens, let's go to the details of the synonyms.txt.
  This file contains a set of lines, each of them describing a synonym
  policy.
  There are 2 different syntaxes accepted :
 
 
 
 
  *couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall =
  tiny,teeny,weeny*
 
 
 - A comma-separated list of words. If the token matches any of the
 words, then all the words in the list are substituted, which will
  include
 the original token.
 
 
 - Two comma-separated lists of words with the symbol = between
 them.
 If the token matches any word on the left, then the list on the right
 is
 substituted. The original token will not be included unless it is also
  in
 the list on the right.
 
 
  Related the expand param, directly from the official Solr
 documentation :
 
  expand: (optional; default: true) If true, a synonym will be expanded to
  all equivalent synonyms. If false, all equivalent synonyms will be
 reduced
  to the first in the list.
 
  So, starting from this definition let's answer to your questions:
 
  1) Related the expand the definition seems quite clear, if anything
 strange
  is occurring to you, let me know
  2) Related your second question, it depends on your synonym.txt file, if
  you are not using the = syntax, you are going to always retrieve all
  the synonyms(
  included the original term)
 
  If you need more info let me know, it can strictly depends how you are
  using the filter as well ( indexing ? querying ? both ? )
  Example :
  If you are using the filter only at Indexing time, then using the =
 syntax
  will prevent the user to search for the original token in the synonym.txt
  relation.
  Because it will not appear in the index.
 
  Cheers
 
 
  2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:
 
   Hi,
  
   Will like to check, for the SynonymFilterFactory, I have the following
 in
   my synonyms.txt:
  
   Titanium Dioxides, titanium oxide, pigment
   pigment, colour, colouring material
  
   If I set expend=false, and I search for q=pigment, I will get results
  that
   matches pigment, Titanium Dioxides and titanium oxide. But it will not
   maches colour and colouring materials, as all equivalent synonyms will
  only
   matches those first in the list.
  
   If I set expend=false, and I search for q=pigment, I'll get results
 that
   matches everything in the list (ie: Titanium Dioxides, titanium oxide,
   colour, colouring material)
  
   Is my understand correct?
  
   Also, I will like to check, how come if I search q=pigment (enclosed
 in
   quotes), I only get matches for Titanium Dioxides and not pigment?
  
   Regards,
   Edwin
  
 
 
 
  --
  --
 
  Benedetti Alessandro
  Visiting card : http://about.me/alessandro_benedetti
 
  Tyger, tyger burning bright
  In the forests of the night,
  What immortal hand or eye
  Could frame thy fearful symmetry?
 
  William Blake - Songs of Experience -1794 England
 




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Queries on SynonymFilterFactory

Accessing an external service ( such a thesaurus website) per each query,
can slow down your system a lot.
Having the synonyms locally, with the Solr integration is much better.

Cheers

2015-05-08 11:46 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

The document seems to point to using AutoPhrasingTokenFilter, putting an
underscore to the multi-term or changing to index time synonyms.

I'm also thinking of putting the synonyms onto a database or query some
thesaurus website when the using enter the search key, instead of using the
SynonymFilterFactory.

Is this a good solution?

Regards,
Edwin
On 8 May 2015 17:40, Alessandro Benedetti benedetti.ale...@gmail.com
wrote:

I found this very interesting article that I think can help in better
understanding the problem :

http://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/

And this :

http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/

Take a look and let me know !

2015-05-08 10:26 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

Thanks for explaining the information.

Currently I'm only using the comma-separated list of words and only
using
the synonym filter at query time. I find that when I set expend = true,
there's quite a number of irrelevant results that came back, and this
didn't happen when I set expend = false.

I've yet to try the lists of words with the symbol = between them.
I'm
trying to solve the multi-word synonyms too, and I found that enclosing
the
multi-word in quotes will solve the issue. But this creates problem and
the
original token is not return if I enclose single word in quotes.

Will using the lists of words with the symbol = between them better
than
the comma-separated list of words to cater to the multi-word synonyms?

Regards,
Edwin

On 8 May 2015 at 17:10, Alessandro Benedetti
benedetti.ale...@gmail.com

wrote:

*couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall =
tiny,teeny,weeny*

- A comma-separated list of words. If the token matches any of the
words, then all the words in the list are substituted, which will
include
the original token.

Related the expand param, directly from the official Solr
documentation :

expand: (optional; default: true) If true, a synonym will be expanded
to
all equivalent synonyms. If false, all equivalent synonyms will be
reduced
to the first in the list.

So, starting from this definition let's answer to your questions:

If you need more info let me know, it can strictly depends how you
are
using the filter as well ( indexing ? querying ? both ? )
Example :
If you are using the filter only at Indexing time, then using the =
syntax
will prevent the user to search for the original token in the
synonym.txt
relation.
Because it will not appear in the index.

Cheers

2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com
:

Hi,

Will like to check, for the SynonymFilterFactory, I have the
following
in
my synonyms.txt:

Titanium Dioxides, titanium oxide, pigment
pigment, colour, colouring material

If I set expend=false, and I search for q=pigment, I will get
results
that
matches pigment, Titanium Dioxides and titanium oxide. But it will
not
maches colour and colouring materials, as all equivalent synonyms
will
only

Re: Solr Multilingual Indexing with one field- Guidance

Is it possible to know a little bit more about the nature of that
multi-lingual field ?
I can see the keywordTokenizer and then a lot of grams calculated from that
token .
What is that field used for ?

2015-05-07 19:23 GMT+01:00 Kuntal Ganguly gangulykuntal1...@gmail.com:

 Our current production index size is 1.5 TB with 3 shards. Currently we
 have the following field type:

 fieldType name=text_ngram class=solr.TextField
 positionIncrementGap=100

 analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CustomNGramFilterFactory minGramSize=3
 maxGramSize=30 preserveOriginal=true/
 /analyzer
 /fieldType

 And the above field type is working well for the US and English language
 clients.

 Now we have some new Chinese and Japanese client ,so after google

 http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/

 https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search

 for best approach for multilingual index,there seems to be pros/cons
 associated with every approach.

 Then i tried RnD with a single field approach and here's my new field type:

 fieldType name=text_multi class=solr.TextField
 positionIncrementGap=100

 analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CJKWidthFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CJKBigramFilterFactory/
 /analyzer
 analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CJKWidthFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CJKBigramFilterFactory/
 filter class=solr.CustomNGramFilterFactory minGramSize=3
 maxGramSize=30 preserveOriginal=true/
 /analyzer
 /fieldType

 I have kept the same tokenizer, only changed the filters.And it is working
 well with all existing search /use-case for English documents as well as
 new use case for Chinese/Japanese documents.

 Now i have the following questions to the Solr experts/developer:

 1) Is this a correct approach to do it? Or i'm missing something?

 2) Can you give me an example where there will be problem with this above
 new field type? A use-case/scenario with example will be very helpful.

 3) Also is there any problem in future with different clients coming up?

 Please provide some guidance




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: New core on Solr Cloud

2015-05-08 Thread shacky

Thank you very much Erick.
Bye

2015-05-06 17:06 GMT+02:00 Erick Erickson erickerick...@gmail.com:
 That should have put one replica on each machine, if it did you're fine.

 Best,
 Erick

 On Wed, May 6, 2015 at 3:58 AM, shacky shack...@gmail.com wrote:
 Ok, I found out that the creation of new core/collection on Solr 5.1
 is made with the bin/solr script.
 So I created a new collection with this command:

 ./solr create_collection -c test -replicationFactor 3

 Is this the correct way?

 Thank you very much,
 Bye!

 2015-05-06 10:02 GMT+02:00 shacky shack...@gmail.com:
 Hi.
 This is my first experience with Solr Cloud.
 I installed three Solr nodes with three ZooKeeper instances and they
 seemed to start well.
 Now I have to create a new replicated core and I'm trying to found out
 how I can do it.
 I found many examples about how to create shards and cores, but I have
 to create one core with only one shard replicated on all three nodes
 (so basically I want to have the same data on all three nodes).

 Could you help me to understand what is the correct way to make this, 
 please?

 Thank you very much!
 Bye

Re: JSON Facet Analytics API in Solr 5.1

2015-05-08 Thread Frank li

Hi Yonik,

Any update for the question?

Thanks in advance,

Frank

On Thu, May 7, 2015 at 2:49 PM, Frank li fudon...@gmail.com wrote:

 Is there any book to read so I won't ask such dummy questions? Thanks.

 On Thu, May 7, 2015 at 2:32 PM, Frank li fudon...@gmail.com wrote:

 This one does not have problem, but how do I include sort in this facet
 query. Basically, I want to write a solr query which can sort the facet
 count ascending. Something like http://localhost:8983/solr
 /demo/query?q=applejson.facet={field=price sort='count asc'}
 http://localhost:8983/solr/demo/query?q=applejson.facet=%7Bx:%27avg%28price%29%27%7D

 I really appreciate your help.

 Frank


 http://localhost:8983/solr/demo/query?q=applejson.facet=%7Bx:%27avg%28price%29%27%7D

 On Thu, May 7, 2015 at 2:24 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Thu, May 7, 2015 at 4:47 PM, Frank li fudon...@gmail.com wrote:
  Hi Yonik,
 
  I am reading your blog. It is helpful. One question for you, for
 following
  example,
 
  curl http://localhost:8983/solr/query -d 'q=*:*rows=0
   json.facet={
 categories:{
   type : terms,
   field : cat,
   sort : { x : desc},
   facet:{
 x : avg(price),
 y : sum(price)
   }
 }
   }
  '
 
 
  If I want to write it in the format of this:
 
 http://localhost:8983/solr/query?q=applejson.facet={x:'avg(campaign_ult_defendant_cnt_is)'}
 ,
  how do I do?

 What problems do you encounter when you try that?

 If you try that URL with curl, be aware that curly braces {} are
 special globbing characters in curl.  Turn them off with the -g
 option:

 curl -g 
 http://localhost:8983/solr/demo/query?q=applejson.facet={x:'avg(price)'}
 

 -Yonik

AW: determine big documents in the index?

2015-05-08 Thread Clemens Wyss DEV

On one of my fields (the phrase suggestion field) has 30'860'099 terms. Is 
this too much?
Another field (the single word suggestion) has 2'156'218 terms.



-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] 
Gesendet: Freitag, 8. Mai 2015 15:54
An: solr-user@lucene.apache.org
Betreff: determine big documents in the index?

Context: Solr/Lucene 5.1

Is there a way to determine documents that occupy alot space in the index. As 
I don't store any fields that have text, it must be the terms extracted from 
the documents occupying the space. 

So my question is: which documents occupy a most space in the inverted index?

Context:
I index approx 7000pdfs (extracted with tika) into my index. I suspect that for 
some pdf's the extarcted text is not really text but binary blobs. In order 
to verify this (and possibly omit these pdfs) I hope to get some hints of 
Solr/Lucene ;)

SolrCloud 4.8.0 - Snapshots directory take a lot of space

2015-05-08 Thread Vincenzo D'Amore

Hi All,

Looking at data directory in my solrcloud cluster I have found a lot of old
snapshot directory in

Like these:
snapshot.20150506003702765
snapshot.20150506003702760
snapshot.20150507002849492
snapshot.20150507002849473
snapshot.20150507002849459

or even a month older. These directories keep really a lot of space, 2 or 3
times the whole index.

May I delete these directories? If yes, is there a best practice?


-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251

Re: Queries on SynonymFilterFactory

This is a quite big Sinonym corpus !
If it's not feasible to have only 1 big synonym file ( I haven't checked,
so I assume the 1 Mb limit is true, even if strange)
I would do an experiment :
1) testing query time with a Solr Classic config
2) Use an Ad Hoc Solr Core to manage Synonyms ( in this way we can keep it
updated and use it with a custom version of the Sysnonym filter that will
get the Synonyms directly from another Solr instance).
2b) develop a Solr plugin to provide this approach

If the synonym thesaurus is really big, I guess managing them through
another Solr Core ( or something similar) locally , will be better than
managing it with an external web service.

Cheers

2015-05-08 12:16 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 So it means like having more than 10 or 20 synonym files locally will still
 be faster than accessing external service?

 As I found out that zookeeper only allows the synonym.txt file to be a
 maximum of 1MB, and as my potential synonym file is more than 20MB, I'll
 need to split the file to more than 20 of them.

 Regards,
 Edwin




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

determine big documents in the index?

2015-05-08 Thread Clemens Wyss DEV

Context: Solr/Lucene 5.1

Is there a way to determine documents that occupy alot space in the index. As 
I don't store any fields that have text, it must be the terms extracted from 
the documents occupying the space. 

So my question is: which documents occupy a most space in the inverted index?

Context:
I index approx 7000pdfs (extracted with tika) into my index. I suspect that for 
some pdf's the extarcted text is not really text but binary blobs. In order 
to verify this (and possibly omit these pdfs) I hope to get some hints of 
Solr/Lucene ;)

Re: How to handle special characters in fuzzy search query

Each of the characters you identified are characters that have meaning
to the query parser, '+' is a mandatory clause, '-' is a NOT operator
and * is a wildcard. To get through the query parser, these (and a
bunch of others, see below) must be escaped.

Personally, though, I'd pre-scrub the data. Depending on your analysis
chain such things may be thrown away anyway.

https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
- the escaping special characters bit.

Best,
Erick

On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna
madhav.bahug...@gmail.com wrote:
 So my solr query is implemented in two parts,first query does an exact
 search if there are no results found for exact then it goes to the second
 query that does a fuzzy search.
 every things works fine but in situations like--A user enters burg +
 So in exact search no records will come,so second query is called to do a
 fuzzy search.Now comes the problem my fuzzy query does not understand
 special characters like +,-* which throws and error.If i dont pass special
 characters it works fine.  But in real world a user can put characters with
 their search,which will throw an error.
 Now iam stuck in this and dont know how to resolve this issue.
 This is how my exact search query looks like

 $query1=(business_name:$data*^100 OR city_name:$data*^1 OR
 locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR
 type_name:$data*^7) AND (business_active_flag:1) AND
 (business_visible_flag:1) AND (delete_status_businessmasters:0);

 This is how my fuzzy query looks like


 $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5
 TO 2.0])q.op=ANDwt=jsonindent=true';

 Iam new to solr and dont know how to tackle this situation.

 Details
 Solrphpclient
 php
 solr 4.9


 --
 Regards
 Madhav Bahuguna

Re: determine big documents in the index?

Oops, this may be a better link: http://lucidworks.com/blog/indexing-with-solrj/

On Fri, May 8, 2015 at 9:55 AM, Erick Erickson erickerick...@gmail.com wrote:
 bq: has 30'860'099 terms. Is this too much

 Depends on how you indexed it. If you used shingles, then maybe, maybe
 not. If you just do normal text analysis, it's suspicious to say the
 least. There are about 300K words in the English language and you have
 100X that. So either
 1 you have a lot of legitimately unique terms, say part numbers,
 SKUs, etc. digits analyzed as text, whatever.
 2 you have a lot of garbage in your input. OCR is notorious for this,
 as are binary blobs.

 The TermsComponent is your friend, it'll allow you to get an idea of
 what the actual terms are, it does take a bit of poking around though.

 There's no good way I know of to know which docs are taking up space
 in the index. What I'd probably do is use Tika in a SolrJ client and
 look at the data as I sent it, here's a place to start:
 https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/

 Best,
 Erick

 On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV clemens...@mysign.ch wrote:
 On one of my fields (the phrase suggestion field) has 30'860'099 terms. Is 
 this too much?
 Another field (the single word suggestion) has 2'156'218 terms.



 -Ursprüngliche Nachricht-
 Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
 Gesendet: Freitag, 8. Mai 2015 15:54
 An: solr-user@lucene.apache.org
 Betreff: determine big documents in the index?

 Context: Solr/Lucene 5.1

 Is there a way to determine documents that occupy alot space in the index. 
 As I don't store any fields that have text, it must be the terms extracted 
 from the documents occupying the space.

 So my question is: which documents occupy a most space in the inverted index?

 Context:
 I index approx 7000pdfs (extracted with tika) into my index. I suspect that 
 for some pdf's the extarcted text is not really text but binary blobs. In 
 order to verify this (and possibly omit these pdfs) I hope to get some hints 
 of Solr/Lucene ;)

Re: Slow highlighting on Solr 5.0.0

2015-05-08 Thread Matt Hilt

I¹ve been looking into this again. The phrase highlighter is much slower
than the default highlighter, so you might be able to add
hl.usePhraseHighlighter=false to your query to make it faster. Note that
web interface will NOT help here, because that param is true by default,
and the checkbox is basically broken in that respect. Also, the default
highlighter doesn¹t seem to work in all case the phrase highlighter does
though. 

Also, the current development branch of 5x is much better than 5.1, but
not as good as 4.10. This ticket seems to be hitting on some of the issues
at hand:
https://issues.apache.org/jira/browse/SOLR-5855


I think this means they are getting there, but the performance is really
still much worse than 4.10, and its not obvious why.


On 5/5/15, 2:06 AM, Ere Maijala ere.maij...@helsinki.fi wrote:

I'm seeing the same with Solr 5.1.0 after upgrading from 4.10.2. Here
are my timings:

4.10.2:
process: 1432.0
highlight: 723.0

5.1.0:
process: 9570.0
highlight: 8790.0

schema.xml and solrconfig.xml are available at
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf
.

A couple of jstack outputs taken when the query was executing are
available at http://pastebin.com/eJrEy2Wb

Any suggestions would be appreciated. Or would it make sense to just
file a JIRA issue?

--Ere

3.3.2015, 0.48, Matt Hilt kirjoitti:
 Short form:
 While testing Solr 5.0.0 within our staging environment, I noticed that
 highlight enabled queries are much slower than I saw with 4.10. Are
 there any obvious reasons why this might be the case? As far as I can
 tell, nothing has changed with the default highlight search component or
 its parameters.


 A little more detail:
 The bulk of the collection config set was stolen from the basic 4.X
 example config set. I changed my schema.xml and solrconfig.xml just
 enough to get 5.0 to create a new collection (removed non-trie fields,
 some other deprecated response handler definitions, etc). I can provide
 my version of the solr.HighlightComponent config, but it is identical to
 the sample_techproducts_configs example in 5.0.  Are there any other
 config files I could provide that might be useful?


 Number on ³much slower²:
 I indexed a very small subset of my data into the new collection and
 used the /select interface to do a simple debug query. Solr 4.10 gives
 the following pertinent info:
 response: { numFound: 72628,
 ...
 debug: {
 timing: { time: 95, process: { time: 94, query: { time: 6 },
 highlight: { time: 84 }, debug: { time: 4 } }
 ---
 Whereas solr 5.0 is:
 response: { numFound: 1093,
 ...
 debug: {
 timing: { time: 6551, process: { time: 6549, query: { time:
 0 }, highlight: { time: 6524 }, debug: { time: 25 }





-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


smime.p7s
Description: S/MIME cryptographic signature

Re: Queries on SynonymFilterFactory

Thank you for your suggestions.

I can't do a proper testing on that yet as I'm currently using a 4GB RAM
normal PC machine, and all these probably requires more RAM that what I
have.
I've tried running the setup with 20 synonyms file, and the system went Out
of Memory before I could test anything.

For your option 2), do you mean that I'll need to download a synonym
database (like the one with over 20MB in size which I have), and index them
into an Ad Hoc Solr Core to manage them?

I probably can only try them out properly when I can get the server machine
with more RAM.

Regards,
Edwin


On 8 May 2015 at 22:16, Alessandro Benedetti benedetti.ale...@gmail.com
wrote:

 This is a quite big Sinonym corpus !
 If it's not feasible to have only 1 big synonym file ( I haven't checked,
 so I assume the 1 Mb limit is true, even if strange)
 I would do an experiment :
 1) testing query time with a Solr Classic config
 2) Use an Ad Hoc Solr Core to manage Synonyms ( in this way we can keep it
 updated and use it with a custom version of the Sysnonym filter that will
 get the Synonyms directly from another Solr instance).
 2b) develop a Solr plugin to provide this approach

 If the synonym thesaurus is really big, I guess managing them through
 another Solr Core ( or something similar) locally , will be better than
 managing it with an external web service.

 Cheers

 2015-05-08 12:16 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com:

  So it means like having more than 10 or 20 synonym files locally will
 still
  be faster than accessing external service?
 
  As I found out that zookeeper only allows the synonym.txt file to be a
  maximum of 1MB, and as my potential synonym file is more than 20MB, I'll
  need to split the file to more than 20 of them.
 
  Regards,
  Edwin
 



 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England

Re: How to get the docs id after commit

Not that I know of. newest doc id is pretty ambiguous. If I transmit
a batch of 100 docs then commit, they're all committed at once. Which
one, then, is newest? And consider what happens if (in SolrCloud)
mode, I send updates to two separate nodes. The docs are forwarded to
the leader for the shard they belong on, and arrival order is not
guaranteed so which one is newest?

Your best bet I think is to include, say, a timestamp or some such
that represents what you consider newest, then just do a *:* query
and sort by your marker descending. The first doc returned will be
what you defined as newest.

Best,
Erick

On Fri, May 8, 2015 at 12:56 AM, liwen(李文).apabi l@founder.com.cn wrote:
 Hi, Solr Developers

   I want to get the newest commited docs in the postcommit event, then 
 nofity the other server which data can be used, but I can not find any way to 
 get the newest docs after commited, so is there any way to do this?

   Thank you.



  Wen Li

Best way to backup and restore an index for a cloud setup in 4.6.1?

2015-05-08 Thread John Smith

All,

With a cloud setup for a collection in 4.6.1, what is the most elegant way
to backup and restore an index?

We are specifically looking into the application of when doing a full
reindex, with the idea of building an index on one set of servers, backing
up the index, and then restoring that backup on another set of servers. Is
there a better way to rebuild indexes on another set of servers?

We are not sharding if that makes any difference.

Thanks,
g10vstmoney

Re: determine big documents in the index?

bq: has 30'860'099 terms. Is this too much

Depends on how you indexed it. If you used shingles, then maybe, maybe
not. If you just do normal text analysis, it's suspicious to say the
least. There are about 300K words in the English language and you have
100X that. So either
1 you have a lot of legitimately unique terms, say part numbers,
SKUs, etc. digits analyzed as text, whatever.
2 you have a lot of garbage in your input. OCR is notorious for this,
as are binary blobs.

The TermsComponent is your friend, it'll allow you to get an idea of
what the actual terms are, it does take a bit of poking around though.

There's no good way I know of to know which docs are taking up space
in the index. What I'd probably do is use Tika in a SolrJ client and
look at the data as I sent it, here's a place to start:
https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV clemens...@mysign.ch wrote:
 On one of my fields (the phrase suggestion field) has 30'860'099 terms. Is 
 this too much?
 Another field (the single word suggestion) has 2'156'218 terms.



 -Ursprüngliche Nachricht-
 Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
 Gesendet: Freitag, 8. Mai 2015 15:54
 An: solr-user@lucene.apache.org
 Betreff: determine big documents in the index?

 Context: Solr/Lucene 5.1

 Is there a way to determine documents that occupy alot space in the index. 
 As I don't store any fields that have text, it must be the terms extracted 
 from the documents occupying the space.

 So my question is: which documents occupy a most space in the inverted index?

 Context:
 I index approx 7000pdfs (extracted with tika) into my index. I suspect that 
 for some pdf's the extarcted text is not really text but binary blobs. In 
 order to verify this (and possibly omit these pdfs) I hope to get some hints 
 of Solr/Lucene ;)

Re: How to handle special characters in fuzzy search query

Steven:

They're listed on the ref guide I posted. Not a concise list, but
you'll see  || and other interesting bits.

On Fri, May 8, 2015 at 9:20 AM, Steven White swhite4...@gmail.com wrote:
 Hi Erick,

 Is there a documented list of all operators (AND, OR, NOT, etc.) that also
 need to be escaped?  Are there more beside the 3 I listed?

 Thanks

 Steve

 On Fri, May 8, 2015 at 11:47 AM, Erick Erickson erickerick...@gmail.com
 wrote:

 Each of the characters you identified are characters that have meaning
 to the query parser, '+' is a mandatory clause, '-' is a NOT operator
 and * is a wildcard. To get through the query parser, these (and a
 bunch of others, see below) must be escaped.

 Personally, though, I'd pre-scrub the data. Depending on your analysis
 chain such things may be thrown away anyway.

 https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
 - the escaping special characters bit.

 Best,
 Erick

 On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna
 madhav.bahug...@gmail.com wrote:
  So my solr query is implemented in two parts,first query does an exact
  search if there are no results found for exact then it goes to the second
  query that does a fuzzy search.
  every things works fine but in situations like--A user enters burg +
  So in exact search no records will come,so second query is called to do a
  fuzzy search.Now comes the problem my fuzzy query does not understand
  special characters like +,-* which throws and error.If i dont pass
 special
  characters it works fine.  But in real world a user can put characters
 with
  their search,which will throw an error.
  Now iam stuck in this and dont know how to resolve this issue.
  This is how my exact search query looks like
 
  $query1=(business_name:$data*^100 OR city_name:$data*^1 OR
  locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR
  type_name:$data*^7) AND (business_active_flag:1) AND
  (business_visible_flag:1) AND (delete_status_businessmasters:0);
 
  This is how my fuzzy query looks like
 
 
 
 $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5
  TO 2.0])q.op=ANDwt=jsonindent=true';
 
  Iam new to solr and dont know how to tackle this situation.
 
  Details
  Solrphpclient
  php
  solr 4.9
 
 
  --
  Regards
  Madhav Bahuguna

Re: How to handle special characters in fuzzy search query

2015-05-08 Thread Steven White

Hi Erick,

Is there a documented list of all operators (AND, OR, NOT, etc.) that also
need to be escaped?  Are there more beside the 3 I listed?

Thanks

Steve

On Fri, May 8, 2015 at 11:47 AM, Erick Erickson erickerick...@gmail.com
wrote:

 Each of the characters you identified are characters that have meaning
 to the query parser, '+' is a mandatory clause, '-' is a NOT operator
 and * is a wildcard. To get through the query parser, these (and a
 bunch of others, see below) must be escaped.

 Personally, though, I'd pre-scrub the data. Depending on your analysis
 chain such things may be thrown away anyway.

 https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
 - the escaping special characters bit.

 Best,
 Erick

 On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna
 madhav.bahug...@gmail.com wrote:
  So my solr query is implemented in two parts,first query does an exact
  search if there are no results found for exact then it goes to the second
  query that does a fuzzy search.
  every things works fine but in situations like--A user enters burg +
  So in exact search no records will come,so second query is called to do a
  fuzzy search.Now comes the problem my fuzzy query does not understand
  special characters like +,-* which throws and error.If i dont pass
 special
  characters it works fine.  But in real world a user can put characters
 with
  their search,which will throw an error.
  Now iam stuck in this and dont know how to resolve this issue.
  This is how my exact search query looks like
 
  $query1=(business_name:$data*^100 OR city_name:$data*^1 OR
  locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR
  type_name:$data*^7) AND (business_active_flag:1) AND
  (business_visible_flag:1) AND (delete_status_businessmasters:0);
 
  This is how my fuzzy query looks like
 
 
 
 $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5
  TO 2.0])q.op=ANDwt=jsonindent=true';
 
  Iam new to solr and dont know how to tackle this situation.
 
  Details
  Solrphpclient
  php
  solr 4.9
 
 
  --
  Regards
  Madhav Bahuguna

Re: Limit the documents for each shard in solr cloud

2015-05-08 Thread Jilani Shaik

Hi,

Actually we are facing lot of issues with Solr shards in our environment.
Our environment is fully loaded with around 150 million documents where
each document will have around 50+ stored fields which has multiple values.
And also we have lot of custom components in this environment which are
using FieldCache and various other Solr features.

The main issue we are facing is shards going down frequently in Solr cloud.

As you mentioned in this reply and I also I have observed various other
reply on memory issues. I will try to debug further and keep posted here if
any issues I found in that process.

Thanks,
Jilani

On Thu, May 7, 2015 at 10:17 PM, Daniel Collins danwcoll...@gmail.com
wrote:

 Jilani, you did say My team needs that option if at all possible, my
 first response would be why?.   Why do they want to limit the number of
 documents per shard, what's the rationale/use case behind that
 requirement?  Once we understand that, we can explain why its a bad idea.
 :)

 I suspect I'm re-iterating Jack's comments, but why are you sharding in the
 first place? 8 shards split across 4 machines, so 2 shards per machine.
 But you have 2 replicas of each shard, so you have 16 Solr core, and hence
 4 Solr cores per machine?  Since you need an instance of all 8 shards to be
 up in order to service requests, you can get away with everything on 2
 machines, but you still have 8 Solr cores to manage in order to have a
 fully functioning system.  What's the benefit of sharding in this
 scenario?  Sharding adds complexity, so you normally only add sharding if
 your search times are too slow without it.

 You need to work out how much disk space the whole 20m docs is going to
 take (maybe index 1m or 5m docs and extrapolate if they are all equivalent
 in size), then split it across 4 machines.  But as Erick points out you
 need to allow for merges to occur, so whatever the space of the static
 data set, you need to allow for double that from time to time if background
 merges are happening.


 On 7 May 2015 at 16:05, Jack Krupansky jack.krupan...@gmail.com wrote:

  A leader is also a replica - SolrCloud is not a master/slave
 architecture.
  Any replica can be elected to be the leader, but that is only temporary
 and
  can change over time.
 
  You can place multiple shards on a single node, but was that really your
  intention?
 
  Generally, number of nodes equals number of shards times the replication
  factor. But then divided by shards per node if you do place more than one
  shard per node.
 
  -- Jack Krupansky
 
  On Thu, May 7, 2015 at 1:29 AM, Jilani Shaik jilani24...@gmail.com
  wrote:
 
   Hi,
  
   Is it possible to restrict number of documents per shard in Solr cloud?
  
   Lets say we have Solr cloud with 4 nodes, and on each node we have one
   leader and one replica. Like wise total we have 8 shards that includes
   replicas. Now I need to index my documents in such a way that each
 shard
   will have only 5 million documents. Total documents in Solr cloud
 should
  be
   20 million documents.
  
  
   Thanks,
   Jilani

Re: Not able to Add docValues in Solr

2015-05-08 Thread pras.venkatesh

Never mind.. used the zkcli.sh that comes with solr to accomplish the
firewall



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-able-to-Add-docValues-in-Solr-tp4204405p4204579.html
Sent from the Solr - User mailing list archive at Nabble.com.

indexing java byte code in classes / jars

I looking to use Solr search over the byte code in Classes and Jars.

Does anyone know or have experience of Analyzers, Tokenizers, and Token
Filters for such a task?

Regards

Mark

Re: How to handle special characters in fuzzy search query

FWIW you may also want to drop the boolean ops in favour of + and - (OR
being default)

pozdrawiam,
LAFK

2015-05-08 18:59 GMT+02:00 Erick Erickson erickerick...@gmail.com:

 Steven:

 They're listed on the ref guide I posted. Not a concise list, but
 you'll see  || and other interesting bits.

 On Fri, May 8, 2015 at 9:20 AM, Steven White swhite4...@gmail.com wrote:
  Hi Erick,
 
  Is there a documented list of all operators (AND, OR, NOT, etc.) that
 also
  need to be escaped?  Are there more beside the 3 I listed?
 
  Thanks
 
  Steve
 
  On Fri, May 8, 2015 at 11:47 AM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Each of the characters you identified are characters that have meaning
  to the query parser, '+' is a mandatory clause, '-' is a NOT operator
  and * is a wildcard. To get through the query parser, these (and a
  bunch of others, see below) must be escaped.
 
  Personally, though, I'd pre-scrub the data. Depending on your analysis
  chain such things may be thrown away anyway.
 
 
 https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
  - the escaping special characters bit.
 
  Best,
  Erick
 
  On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna
  madhav.bahug...@gmail.com wrote:
   So my solr query is implemented in two parts,first query does an exact
   search if there are no results found for exact then it goes to the
 second
   query that does a fuzzy search.
   every things works fine but in situations like--A user enters burg
 +
   So in exact search no records will come,so second query is called to
 do a
   fuzzy search.Now comes the problem my fuzzy query does not understand
   special characters like +,-* which throws and error.If i dont pass
  special
   characters it works fine.  But in real world a user can put characters
  with
   their search,which will throw an error.
   Now iam stuck in this and dont know how to resolve this issue.
   This is how my exact search query looks like
  
   $query1=(business_name:$data*^100 OR city_name:$data*^1 OR
   locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR
   type_name:$data*^7) AND (business_active_flag:1) AND
   (business_visible_flag:1) AND (delete_status_businessmasters:0);
  
   This is how my fuzzy query looks like
  
  
  
 
 $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5
   TO 2.0])q.op=ANDwt=jsonindent=true';
  
   Iam new to solr and dont know how to tackle this situation.
  
   Details
   Solrphpclient
   php
   solr 4.9
  
  
   --
   Regards
   Madhav Bahuguna

Re: and stopword in user query is being change to q.op=AND

2015-05-08 Thread Rajesh Hazari

Thanks Show and Hoss.
Just added lowercaseOperators=false to my edismax config and everything
seems to be working.

*Thanks,*
*Rajesh,*
*(mobile) : 8328789519.*

On Mon, Apr 27, 2015 at 11:53 AM, Rajesh Hazari rajeshhaz...@gmail.com
wrote:

 I did go through the documentation of edismax (solr 5.1 documentation),
 that suggests to use *stopwords* query param that signal the parser to
 respect stopfilterfactory while parsing, still i did not find this is
 happening.

 my final query looks like this


 http://host/solr/collection/select?q=term1+and+term2sort=update_time+descrows=1wt=jsonindent=truedebugQuery=truedefType=edismaxstopwords=trugroup=truegroup.ngroups=truegroup.field=titlestopwords=true


 debug:{
 rawquerystring:term1 and term2,
 querystring:term1 and term2,
 parsedquery:(+(+DisjunctionMaxQuery((textSpell:term1)) 
 +DisjunctionMaxQuery((textSpell:term2/no_coord,
 parsedquery_toString:+(+(textSpell:term1) +(textSpell:term2)),
 explain:{},
 QParser:ExtendedDismaxQParser,...

 ..

 Is this param introduced and supports from specific version of solr!

 our solr version is 4.7 and 4.9.


 *Thanks,*
 *Rajesh**.*

 On Sun, Apr 26, 2015 at 9:22 PM, Rajesh Hazari rajeshhaz...@gmail.com
 wrote:

 Thank you Hoss from correcting my understanding, again i missed this
 concept of edismax.

 Do we have any solrj class or helper to handle the scenario to pass on
 the query terms (by stripping stopwords ) to edismax using solrj api.
 for ex: if user queries for *term1 and term2* build and query to pass
 on this to edismax so that this user query will be parsed as


 *parsedquery: (+(DisjunctionMaxQuery((textSpell:term1)
 DisjunctionMaxQuery((textSpell:term2/no_coord *

 *Thanks,*
 *Rajesh**.*

 On Fri, Apr 24, 2015 at 1:13 PM, Chris Hostetter 
 hossman_luc...@fucit.org wrote:


 : I was under understanding that stopwords are filtered even before being
 : parsed by search handler, i do have the filter in collection schema to
 : filter stopwords and the analysis shows that this stopword is filtered

 Generally speaking, your understanding of the order of operations for
 query parsing (regardless of hte parser) and analysis (regardless of the
 fields/analyzers/filters/etc...) is backwards.


 the query parser gets, as it's input, the query string (as a *single*
 string) and the request params.  it inspects/parses the string according
 to it's rules  options  syntax and based on what it finds in that
 string
 (and in other request params) it passes some/all of that string to the
 analyzer for one or more fields, and uses the results of those analyzers
 as the terms for building up a query structure.

 ask yourself: if the raw user query input was first passed to an analyzer
 (for stop word filtering as you suggest) before the being passed to the
 query parser -- how would solr know what analyzer to use?  in many
 parsers
 (like lucene and edismax) the fields to use can be specified *inside* the
 query string itself

 likewise: how would you ensure that syntactically significant string
 sequences (like ( and : and AND etc..) that an analyzer might
 normally strip out based on the tokenizer/tokenfilters would be preserved
 so that the query parser could have them and use them to drive hte
 resulting query structure?



 -Hoss
 http://www.lucidworks.com/

Re: indexing java byte code in classes / jars

2015-05-08 Thread Mike Drob

What do the various Java IDEs use for indexing classes for
field/type/variable/method usage search? I imagine it's got to be bytecode.

On Fri, May 8, 2015 at 2:40 PM, Tomasz Borek tomasz.bo...@gmail.com wrote:

 Out of curiosity: why bytecode?

 pozdrawiam,
 LAFK

 2015-05-08 21:31 GMT+02:00 Mark javam...@gmail.com:

  I looking to use Solr search over the byte code in Classes and Jars.
 
  Does anyone know or have experience of Analyzers, Tokenizers, and Token
  Filters for such a task?
 
  Regards
 
  Mark

Re: indexing java byte code in classes / jars

Out of curiosity: why bytecode?

pozdrawiam,
LAFK

2015-05-08 21:31 GMT+02:00 Mark javam...@gmail.com:

 I looking to use Solr search over the byte code in Classes and Jars.

 Does anyone know or have experience of Analyzers, Tokenizers, and Token
 Filters for such a task?

 Regards

 Mark

Re: Solr Exception The remote server returned an error: (400) Bad Request.

Short answer: wget skips body on 400 assuming you didn't want error page
stored.
Long answer: get your error page with additional wget params, like so:

✗ wget -Sd http://10.0.3.113:8080/solr/collection1/vitas\?q\=coreD%3A25
DEBUG output created by Wget 1.15 on linux-gnu.

URI encoding = `UTF-8'
--2015-05-08 21:56:55--
http://10.0.3.113:8080/solr/collection1/vitas?q=coreD%3A25
Łączenie się z 10.0.3.113:8080... połączono.
Created socket 3.
Releasing 0x00aa35d0 (new refcount 0).
Deleting unused 0x00aa35d0.

---request begin---
GET /solr/collection1/vitas?q=coreD%3A25 HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: 10.0.3.113:8080
Connection: Keep-Alive

---request end---
Żądanie HTTP wysłano, oczekiwanie na odpowiedź...
---response begin---
HTTP/1.1 400 Bad Request
Server: Apache-Coyote/1.1
Cache-Control: no-cache, no-store
Pragma: no-cache
Expires: Sat, 01 Jan 2000 01:00:00 GMT
Last-Modified: Fri, 08 May 2015 19:56:55 GMT
ETag: 14d351a25a9
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Fri, 08 May 2015 19:56:55 GMT
Connection: close

---response end---

  HTTP/1.1 400 Bad Request
  Server: Apache-Coyote/1.1
  Cache-Control: no-cache, no-store
  Pragma: no-cache
  Expires: Sat, 01 Jan 2000 01:00:00 GMT
  Last-Modified: Fri, 08 May 2015 19:56:55 GMT
  ETag: 14d351a25a9
  Content-Type: application/json;charset=UTF-8
  Transfer-Encoding: chunked
  Date: Fri, 08 May 2015 19:56:55 GMT
  Connection: close
Registered socket 3 for persistent reuse.
URI content encoding = `UTF-8'
Skipping 95 bytes of body:
[{responseHeader:{status:400,QTime:2},error:{msg:undefined field
coreD,code:400}}
] done.
2015-05-08 21:56:55 BŁĄD 400: Bad Request.



pozdrawiam,
LAFK

2015-05-05 17:44 GMT+02:00 marotosg marot...@gmail.com:

 Thanks for the answer but i don't think that's going to solve my
 problem.For
 instance if I copy this query in the chrome
 browserhttp://localhost:8080/solr48/person/select?q=CoreD:25I get this
 error.4001CoreD:25undefined field CoreD400If I use wget  from linux wget
 http://localhost:8080/solr48/person/select?q=CoreD:25I get ERROR:400 Bad
 Request.Is any reason why I am not getting same error?Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889p4203949.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing java byte code in classes / jars

To answer why bytecode - because mostly the use case I have is looking to
index as much detail from jars/classes.

extract class names,
method names
signatures
packages / imports

I am considering using ASM in order to generate an analysis view of the
class

The sort of usecases I have would be method / signature searches.

For example;

1) show any classes with a method named parse*

2) show any classes with a method named parse that passes in a type *json*

...etc

In the past I have written something to reverse out javadocs from just java
bytecode, using solr would move this idea considerably much more powerful.

Thanks for the suggestions so far







On 8 May 2015 at 21:19, Erik Hatcher erik.hatc...@gmail.com wrote:

 Oh, and sorry, I omitted a couple of details:

 # creating the “java” core/collection
 bin/solr create -c java

 # I ran this from my Solr source code checkout, so that
 SolrLogFormatter.class just happened to be handy

 Erik




  On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
  What kinds of searches do you want to run?  Are you trying to extract
 class names, method names, and such and make those searchable?   If that’s
 the case, you need some kind of “parser” to reverse engineer that
 information from .class and .jar files before feeding it to Solr, which
 would happen before analysis.   Java itself comes with a javap command that
 can do this; whether this is the “best” way to go for your scenario I don’t
 know, but here’s an interesting example pasted below (using Solr 5.x).
 
  —
  Erik Hatcher, Senior Solutions Architect
  http://www.lucidworks.com
 
 
  javap
 build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class 
 test.txt
  bin/post -c java test.txt
 
  now search for coreInfoMap
 http://localhost:8983/solr/java/browse?q=coreInfoMap
 
  I tried to be cleverer and use the stdin option of bin/post, like this:
  javap
 build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class |
 bin/post -c java -url http://localhost:8983/solr/java/update/extract
 -type text/plain -params literal.id=SolrLogFormatter -out yes -d
  but something isn’t working right with the stdin detection like that (it
 does work to `cat test.txt | bin/post…` though, hmmm)
 
  test.txt looks like this, `cat test.txt`:
  Compiled from SolrLogFormatter.java
  public class org.apache.solr.SolrLogFormatter extends
 java.util.logging.Formatter {
   long startTime;
   long lastTime;
   java.util.Maporg.apache.solr.SolrLogFormatter$Method,
 java.lang.String methodAlias;
   public boolean shorterFormat;
   java.util.Maporg.apache.solr.core.SolrCore,
 org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap;
   public java.util.Mapjava.lang.String, java.lang.String classAliases;
   static java.lang.ThreadLocaljava.lang.String threadLocal;
   public org.apache.solr.SolrLogFormatter();
   public void setShorterFormat();
   public java.lang.String format(java.util.logging.LogRecord);
   public void appendThread(java.lang.StringBuilder,
 java.util.logging.LogRecord);
   public java.lang.String _format(java.util.logging.LogRecord);
   public java.lang.String getHead(java.util.logging.Handler);
   public java.lang.String getTail(java.util.logging.Handler);
   public java.lang.String formatMessage(java.util.logging.LogRecord);
   public static void main(java.lang.String[]) throws java.lang.Exception;
   public static void go() throws java.lang.Exception;
   static {};
  }
 
  On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote:
 
  I looking to use Solr search over the byte code in Classes and Jars.
 
  Does anyone know or have experience of Analyzers, Tokenizers, and Token
  Filters for such a task?
 
  Regards
 
  Mark

Re: Fuzzy phrases + weighting at query level or do I need to program?

Best I found so far is:

+place:(+word1~ +word2~ +word3~)

pozdrawiam,
LAFK

2015-04-26 3:20 GMT+02:00 Tomasz Borek tomasz.bo...@gmail.com:

 Ave!

 How do I make fuzzy search on lengthy names? As in La Riviera Montana de
 los Diablos or Unified Mega Corp Super Dwelling? Across all queries?

 My query has 3 levels of results:
 Best results are: +title:X +place:Y - Q1
 If none such are found, +title:x - Q2
 then +place:Y - Q3

 All in all: (Q1) (Q2) (Q3)

 Yonik's examples have fuzzy search by ~ on one-term, on more than one it
 becomes proximity search.

 pozdrawiam,
 LAFK

Re: indexing java byte code in classes / jars

2015-05-08 Thread Erik Hatcher

Oh, and sorry, I omitted a couple of details:

# creating the “java” core/collection
bin/solr create -c java 

# I ran this from my Solr source code checkout, so that SolrLogFormatter.class 
just happened to be handy

Erik




 On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
 What kinds of searches do you want to run?  Are you trying to extract class 
 names, method names, and such and make those searchable?   If that’s the 
 case, you need some kind of “parser” to reverse engineer that information 
 from .class and .jar files before feeding it to Solr, which would happen 
 before analysis.   Java itself comes with a javap command that can do this; 
 whether this is the “best” way to go for your scenario I don’t know, but 
 here’s an interesting example pasted below (using Solr 5.x).
 
 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com
 
 
 javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class  
 test.txt
 bin/post -c java test.txt
 
 now search for coreInfoMap 
 http://localhost:8983/solr/java/browse?q=coreInfoMap
 
 I tried to be cleverer and use the stdin option of bin/post, like this: 
 javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | 
 bin/post -c java -url http://localhost:8983/solr/java/update/extract -type 
 text/plain -params literal.id=SolrLogFormatter -out yes -d
 but something isn’t working right with the stdin detection like that (it does 
 work to `cat test.txt | bin/post…` though, hmmm)
 
 test.txt looks like this, `cat test.txt`:
 Compiled from SolrLogFormatter.java
 public class org.apache.solr.SolrLogFormatter extends 
 java.util.logging.Formatter {
  long startTime;
  long lastTime;
  java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String 
 methodAlias;
  public boolean shorterFormat;
  java.util.Maporg.apache.solr.core.SolrCore, 
 org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap;
  public java.util.Mapjava.lang.String, java.lang.String classAliases;
  static java.lang.ThreadLocaljava.lang.String threadLocal;
  public org.apache.solr.SolrLogFormatter();
  public void setShorterFormat();
  public java.lang.String format(java.util.logging.LogRecord);
  public void appendThread(java.lang.StringBuilder, 
 java.util.logging.LogRecord);
  public java.lang.String _format(java.util.logging.LogRecord);
  public java.lang.String getHead(java.util.logging.Handler);
  public java.lang.String getTail(java.util.logging.Handler);
  public java.lang.String formatMessage(java.util.logging.LogRecord);
  public static void main(java.lang.String[]) throws java.lang.Exception;
  public static void go() throws java.lang.Exception;
  static {};
 }
 
 On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote:
 
 I looking to use Solr search over the byte code in Classes and Jars.
 
 Does anyone know or have experience of Analyzers, Tokenizers, and Token
 Filters for such a task?
 
 Regards
 
 Mark

Re: indexing java byte code in classes / jars

Erik,

Thanks for the pretty much OOTB approach.

I think I'm going to just try a range of approaches, and see how far I get.

The IDE does this suggestion would be worth looking into as well.




On 8 May 2015 at 22:14, Mark javam...@gmail.com wrote:


 https://searchcode.com/

 looks really interesting, however I want to crunch as much searchable
 aspects out of jars sititng on a classpath or under a project structure...

 Really early days so I'm open to any suggestions



 On 8 May 2015 at 22:09, Mark javam...@gmail.com wrote:

 To answer why bytecode - because mostly the use case I have is looking to
 index as much detail from jars/classes.

 extract class names,
 method names
 signatures
 packages / imports

 I am considering using ASM in order to generate an analysis view of the
 class

 The sort of usecases I have would be method / signature searches.

 For example;

 1) show any classes with a method named parse*

 2) show any classes with a method named parse that passes in a type *json*

 ...etc

 In the past I have written something to reverse out javadocs from just
 java bytecode, using solr would move this idea considerably much more
 powerful.

 Thanks for the suggestions so far







 On 8 May 2015 at 21:19, Erik Hatcher erik.hatc...@gmail.com wrote:

 Oh, and sorry, I omitted a couple of details:

 # creating the “java” core/collection
 bin/solr create -c java

 # I ran this from my Solr source code checkout, so that
 SolrLogFormatter.class just happened to be handy

 Erik




  On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:
 
  What kinds of searches do you want to run?  Are you trying to extract
 class names, method names, and such and make those searchable?   If that’s
 the case, you need some kind of “parser” to reverse engineer that
 information from .class and .jar files before feeding it to Solr, which
 would happen before analysis.   Java itself comes with a javap command that
 can do this; whether this is the “best” way to go for your scenario I don’t
 know, but here’s an interesting example pasted below (using Solr 5.x).
 
  —
  Erik Hatcher, Senior Solutions Architect
  http://www.lucidworks.com
 
 
  javap
 build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class 
 test.txt
  bin/post -c java test.txt
 
  now search for coreInfoMap
 http://localhost:8983/solr/java/browse?q=coreInfoMap
 
  I tried to be cleverer and use the stdin option of bin/post, like this:
  javap
 build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class |
 bin/post -c java -url http://localhost:8983/solr/java/update/extract
 -type text/plain -params literal.id=SolrLogFormatter -out yes -d
  but something isn’t working right with the stdin detection like that
 (it does work to `cat test.txt | bin/post…` though, hmmm)
 
  test.txt looks like this, `cat test.txt`:
  Compiled from SolrLogFormatter.java
  public class org.apache.solr.SolrLogFormatter extends
 java.util.logging.Formatter {
   long startTime;
   long lastTime;
   java.util.Maporg.apache.solr.SolrLogFormatter$Method,
 java.lang.String methodAlias;
   public boolean shorterFormat;
   java.util.Maporg.apache.solr.core.SolrCore,
 org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap;
   public java.util.Mapjava.lang.String, java.lang.String classAliases;
   static java.lang.ThreadLocaljava.lang.String threadLocal;
   public org.apache.solr.SolrLogFormatter();
   public void setShorterFormat();
   public java.lang.String format(java.util.logging.LogRecord);
   public void appendThread(java.lang.StringBuilder,
 java.util.logging.LogRecord);
   public java.lang.String _format(java.util.logging.LogRecord);
   public java.lang.String getHead(java.util.logging.Handler);
   public java.lang.String getTail(java.util.logging.Handler);
   public java.lang.String formatMessage(java.util.logging.LogRecord);
   public static void main(java.lang.String[]) throws
 java.lang.Exception;
   public static void go() throws java.lang.Exception;
   static {};
  }
 
  On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote:
 
  I looking to use Solr search over the byte code in Classes and Jars.
 
  Does anyone know or have experience of Analyzers, Tokenizers, and
 Token
  Filters for such a task?
 
  Regards
 
  Mark

Re: indexing java byte code in classes / jars

2015-05-08 Thread Erik Hatcher

What kinds of searches do you want to run?  Are you trying to extract class 
names, method names, and such and make those searchable?   If that’s the case, 
you need some kind of “parser” to reverse engineer that information from .class 
and .jar files before feeding it to Solr, which would happen before analysis.   
Java itself comes with a javap command that can do this; whether this is the 
“best” way to go for your scenario I don’t know, but here’s an interesting 
example pasted below (using Solr 5.x).

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com


javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class  
test.txt
bin/post -c java test.txt

now search for coreInfoMap 
http://localhost:8983/solr/java/browse?q=coreInfoMap

I tried to be cleverer and use the stdin option of bin/post, like this: 
javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | 
bin/post -c java -url http://localhost:8983/solr/java/update/extract -type 
text/plain -params literal.id=SolrLogFormatter -out yes -d
but something isn’t working right with the stdin detection like that (it does 
work to `cat test.txt | bin/post…` though, hmmm)

test.txt looks like this, `cat test.txt`:
Compiled from SolrLogFormatter.java
public class org.apache.solr.SolrLogFormatter extends 
java.util.logging.Formatter {
  long startTime;
  long lastTime;
  java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String 
methodAlias;
  public boolean shorterFormat;
  java.util.Maporg.apache.solr.core.SolrCore, 
org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap;
  public java.util.Mapjava.lang.String, java.lang.String classAliases;
  static java.lang.ThreadLocaljava.lang.String threadLocal;
  public org.apache.solr.SolrLogFormatter();
  public void setShorterFormat();
  public java.lang.String format(java.util.logging.LogRecord);
  public void appendThread(java.lang.StringBuilder, 
java.util.logging.LogRecord);
  public java.lang.String _format(java.util.logging.LogRecord);
  public java.lang.String getHead(java.util.logging.Handler);
  public java.lang.String getTail(java.util.logging.Handler);
  public java.lang.String formatMessage(java.util.logging.LogRecord);
  public static void main(java.lang.String[]) throws java.lang.Exception;
  public static void go() throws java.lang.Exception;
  static {};
}

 On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote:
 
 I looking to use Solr search over the byte code in Classes and Jars.
 
 Does anyone know or have experience of Analyzers, Tokenizers, and Token
 Filters for such a task?
 
 Regards
 
 Mark

RE: indexing java byte code in classes / jars

2015-05-08 Thread Reitzel, Charles

There are a number of reverse compilers for Java.   Some are quite good and 
very detailed, so long as the byte code has not been deliberately obfuscated.   
Of course the original sources would be better for picking up comments.   But, 
then you'd need a java parser (the compiler front end), of which there are a 
few available as well.

Hmm, this looks interesting ...
https://searchcode.com/


-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Friday, May 08, 2015 4:19 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing java byte code in classes / jars

Oh, and sorry, I omitted a couple of details:

# creating the “java” core/collection
bin/solr create -c java 

# I ran this from my Solr source code checkout, so that SolrLogFormatter.class 
just happened to be handy

Erik




 On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
 What kinds of searches do you want to run?  Are you trying to extract class 
 names, method names, and such and make those searchable?   If that’s the 
 case, you need some kind of “parser” to reverse engineer that information 
 from .class and .jar files before feeding it to Solr, which would happen 
 before analysis.   Java itself comes with a javap command that can do this; 
 whether this is the “best” way to go for your scenario I don’t know, but 
 here’s an interesting example pasted below (using Solr 5.x).
 
 —
 Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com
 
 
 javap 
 build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class  
 test.txt bin/post -c java test.txt
 
 now search for coreInfoMap 
 http://localhost:8983/solr/java/browse?q=coreInfoMap
 
 I tried to be cleverer and use the stdin option of bin/post, like this: 
 javap 
 build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | 
 bin/post -c java -url http://localhost:8983/solr/java/update/extract 
 -type text/plain -params literal.id=SolrLogFormatter -out yes -d but 
 something isn’t working right with the stdin detection like that (it 
 does work to `cat test.txt | bin/post…` though, hmmm)
 
 test.txt looks like this, `cat test.txt`:
 Compiled from SolrLogFormatter.java
 public class org.apache.solr.SolrLogFormatter extends 
 java.util.logging.Formatter {  long startTime;  long lastTime;  
 java.util.Maporg.apache.solr.SolrLogFormatter$Method, 
 java.lang.String methodAlias;  public boolean shorterFormat;  
 java.util.Maporg.apache.solr.core.SolrCore, 
 org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap;  public 
 java.util.Mapjava.lang.String, java.lang.String classAliases;  
 static java.lang.ThreadLocaljava.lang.String threadLocal;  public 
 org.apache.solr.SolrLogFormatter();
  public void setShorterFormat();
  public java.lang.String format(java.util.logging.LogRecord);
  public void appendThread(java.lang.StringBuilder, 
 java.util.logging.LogRecord);  public java.lang.String 
 _format(java.util.logging.LogRecord);
  public java.lang.String getHead(java.util.logging.Handler);
  public java.lang.String getTail(java.util.logging.Handler);
  public java.lang.String formatMessage(java.util.logging.LogRecord);
  public static void main(java.lang.String[]) throws 
 java.lang.Exception;  public static void go() throws 
 java.lang.Exception;  static {}; }
 
 On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote:
 
 I looking to use Solr search over the byte code in Classes and Jars.
 
 Does anyone know or have experience of Analyzers, Tokenizers, and 
 Token Filters for such a task?
 
 Regards
 
 Mark
 


*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*

Re: indexing java byte code in classes / jars