ngrams with position

2016-03-07 Thread elisabeth benoit
Hello,

I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght
with a position in the end.

For instance, with fix lenght 3, Amsterdam would be something like:


a0 (two spaces added at beginning)
am1
ams2
mst3
ste4
ter5
erd6
rda7
dam8
am9 (one more space in the end)

The number at the end being the position.

Does anyone have a clue how to achieve this?

Best regards,
Elisabeth


Currency range Filter taking long time

2016-03-07 Thread stephanustedy
Hi.
I have issues about currency filter.

1. Currency
my field is price_c which is can be USD or local currency.
I'm doing filter price_c:[0 TO 500].
but sometimes my query takes more than 2 seconds.

2. Warm up
I Have master - slave replication with 1 master and 2 slave with same spec.
everytime I do commit and replicate, 
my server load is going up.
I use autosoftcommit every 300 seconds and hardcommit every 600 seconds.

what should I do to fix these ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Currency-range-Filter-taking-long-time-tp4262371.html
Sent from the Solr - User mailing list archive at Nabble.com.


UIMA processing issues with atomic updates

2016-03-07 Thread srinivasarao vundavalli
Hi,

I have Solr 5.5.0 configured with UIMA and Tika. I am facing issues when I
am doing atomic updates for the documents already indexed.



  

  3




true



  false
  
  
content
title
  



  
org.apache.uima.TokenAnnotation

  coveredText
  posVals


  posTag
  posTags

  
 
  



  




  uima

  

// Tika configuration



  last_modified
  
  ignored_
  true







  -MM-dd

  


My schema has the fields 'title', 'content' which are used by UIMA and
copied to 'text' using copyField.






   

I tried removing the stored="true" for 'text' field. But no luck.

This link https://issues.apache.org/jira/browse/SOLR-8528 says it's fixed,
but I am still facing the issue.
Can someone please help me with this?

Thanks,
Srini

-- 
http://cheyuta-helpinghands.blogspot.com


Multiple custom Similarity implementations

2016-03-07 Thread Parvesh Garg
Hi,

We have a requirement where we want to run an A/B test over multiple
Similarity implementations. Is it possible to define multiple similarity
tags in schema.xml file and chose one using the URL parameter? We are using
solr 4.7

Currently, we are planning to have different cores with different
similarity configured and split traffic based on core names. This is
leading to index duplication and un-necessary resource usage.

Any help is highly appreciated.

Parvesh Garg,

http://www.zettata.com


Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
What do you mean "the rest of the cluster"? The routing is based on
the key provided. All of the "enu" prefixes will go to one of your
shards. All the "deu" docs will appear on one shard. All the "esp"
will be on one shard. All the "chs" docs will be on one shard.

Which shard will each go to? Good question. Especially when you have
small numbers of keys and/or one of the keys has a majority of your
corpus you can end up with very uneven distributions. If you require
individual control, what I'd do is create separate _collections_ for
each language, then use collection aliasing to have a single URL to
query. Of course that requires that you index to the correct
collection You could also create a collection for the language
with the most docs and one for "everything else". Or

The advantage here is that the collection can be tailored to the
number of docs. That is, the Spanish collection may be a single shard
whereas the English one may be 4 shards

But really, with a corpus this size I wouldn't worry about it. I
suspect you're over-thinking the problem.

And one addendum to Walter's comment... I often turn caching off (or
wy down) when doing perf testing if I can't mine logs for, say,
100K queries in an attempt to negate effects of caching, but that
doesn't force swapping though which is its weakness.

I worked with one client that was thrilled at getting < 5ms response
times for their stress tests with many threads simultaneously
executing queries except they were firing the exact same query
over and over and over.

Best,
Erick

On Mon, Mar 7, 2016 at 7:36 PM, shamik  wrote:
> Thanks Eric and Walter, this is extremely insightful. One last followup
> question on composite routing. I'm trying to have a better understanding of
> index distribution. If I use language as a prefix, SolrCloud guarantees that
> same language content will be routed to the same shard. What I'm curious to
> know is how rest of the data is being distributed across remaining shards.
> For e.g. I've the following composite keys,
>
> enu!doc1
> enu!doc2
> deu!doc3
> deu!doc4
> esp!doc5
> chs!doc6
>
> If I've 2 shards in the cluster, will SolrCloud try to distribute the above
> data evenly? Is is possible that enu will be routed to shard1 while deu goes
> to shard2, and esp and chs gets indexed in either of them. Or, all of them
> can potentially end up getting indexed in the same shard, either 1 or 2,
> leaving one shard under-utilized.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262336.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks Eric and Walter, this is extremely insightful. One last followup
question on composite routing. I'm trying to have a better understanding of
index distribution. If I use language as a prefix, SolrCloud guarantees that
same language content will be routed to the same shard. What I'm curious to
know is how rest of the data is being distributed across remaining shards.
For e.g. I've the following composite keys,

enu!doc1
enu!doc2
deu!doc3
deu!doc4
esp!doc5
chs!doc6

If I've 2 shards in the cluster, will SolrCloud try to distribute the above
data evenly? Is is possible that enu will be routed to shard1 while deu goes
to shard2, and esp and chs gets indexed in either of them. Or, all of them
can potentially end up getting indexed in the same shard, either 1 or 2,
leaving one shard under-utilized.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: High Cpu sys usage

2016-03-07 Thread Shawn Heisey
On 3/7/2016 2:23 AM, Toke Eskildsen wrote:
> How does this relate to YouPeng reporting that the CPU usage increases?
>
> This is not a snark. YouPeng mentions kernel issues. It might very well
> be that IO is the real problem, but that it manifests in a non-intuitive
> way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am
> not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO
> system is struggling, even if IO-Wait is low?

It might turn out to be not directly related to memory, you're right
about that.  A very high query rate or particularly CPU-heavy queries or
analysis could cause high CPU usage even when memory is plentiful, but
in that situation I would expect high user percentage, not kernel.  I'm
not completely sure what might cause high kernel usage if iowait is low,
but no specific information was given about iowait.  I've seen iowait
percentages of 10% or less with problems clearly caused by iowait.

With the available information (especially seeing 700GB of index data),
I believe that the "not enough memory" scenario is more likely than
anything else.  If the OP replies and says they have plenty of memory,
then we can move on to the less common (IMHO) reasons for high CPU with
a large index.

If the OS is one that reports load average, I am curious what the 5
minute average is, and how many real (non-HT) CPU cores there are.

Thanks,
Shawn



Re: Solr Cloud sharding strategy

2016-03-07 Thread Walter Underwood
Excellent advice, and I’d like to reinforce a few things.

* Solr indexing is CPU intensive and generates lots of disk IO. Faster CPUs and 
faster disks matter a lot.
* Realistic user query logs are super important. We measure 95th percentile 
latency and that is dominated by rare and malformed queries.
* 5000 queries is not nearly enough. That totally fits in cache. I usually 
start with 100K, though I’d like more. Benchmarking a cached system is one of 
the hardest things in devops.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 7, 2016, at 4:27 PM, Erick Erickson  wrote:
> 
> Still, 50M is not excessive for a single shard although it's getting
> into the range that I'd like proof that my hardware etc. is adequate
> before committing to it. I've seen up to 300M docs on a single
> machine, admittedly they were tweets. YMMV based on hardware and index
> complexity of course. Here's a long blog about sizing:
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> In this case I'd be pretty comfortable by creating a test harness
> (using jMeter or the like) and faking the extra 30M documents by
> re-indexing the current corpus but assigning new IDs ( Keep doing this until your target machine breaks (i.e. either blows up
> by exhausting memory or the response slows unacceptably) and that'll
> give you a good upper bound. Note that you should plan on a couple of
> rounds of tuning/testing when you start to have problems.
> 
> I'll warn you up front, though, that unless you have an existing app
> to mine for _real_ user queries, generating say 5,000 "typical"
> queries is more of a challenge than you might expect ;)...
> 
> Now, all that said all is not lost if you do go with a single shard.
> Let's say that 6 months down the road your requirements change. Or the
> initial estimate was off. Or
> 
> There are a couple of options:
> 1> create a new collection with more shards and re-index from scratch
> 2> use the SPLITSHARD Collections API all to, well, split the shard.
> 
> 
> In this latter case, a shard is split into two pieces of roughly equal
> size, which does mean that you can only grow your shard count by
> powers of 2.
> 
> And even if you do have a single shard, using SolrCloud is still a
> good thing as the failover is automagically handled assuming you have
> more than one replica...
> 
> Best,
> Erick
> 
> On Mon, Mar 7, 2016 at 4:05 PM, shamik  wrote:
>> Thanks a lot, Erick. You are right, it's a tad small with around 20 million
>> documents, but the growth projection around 50 million in next 6-8 months.
>> It'll continue to grow, but maybe not at the same rate. From the index size
>> point of view, the size can grow up to half a TB from its current state.
>> Honestly, my perception of "big" index is still vague :-) . All I'm trying
>> to make sure is that decision I take is scalable in the long term and will
>> be able to sustain the growth without compromising the performance.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
Still, 50M is not excessive for a single shard although it's getting
into the range that I'd like proof that my hardware etc. is adequate
before committing to it. I've seen up to 300M docs on a single
machine, admittedly they were tweets. YMMV based on hardware and index
complexity of course. Here's a long blog about sizing:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

In this case I'd be pretty comfortable by creating a test harness
(using jMeter or the like) and faking the extra 30M documents by
re-indexing the current corpus but assigning new IDs ( create a new collection with more shards and re-index from scratch
2> use the SPLITSHARD Collections API all to, well, split the shard.


In this latter case, a shard is split into two pieces of roughly equal
size, which does mean that you can only grow your shard count by
powers of 2.

And even if you do have a single shard, using SolrCloud is still a
good thing as the failover is automagically handled assuming you have
more than one replica...

Best,
Erick

On Mon, Mar 7, 2016 at 4:05 PM, shamik  wrote:
> Thanks a lot, Erick. You are right, it's a tad small with around 20 million
> documents, but the growth projection around 50 million in next 6-8 months.
> It'll continue to grow, but maybe not at the same rate. From the index size
> point of view, the size can grow up to half a TB from its current state.
> Honestly, my perception of "big" index is still vague :-) . All I'm trying
> to make sure is that decision I take is scalable in the long term and will
> be able to sustain the growth without compromising the performance.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks a lot, Erick. You are right, it's a tad small with around 20 million
documents, but the growth projection around 50 million in next 6-8 months.
It'll continue to grow, but maybe not at the same rate. From the index size
point of view, the size can grow up to half a TB from its current state.
Honestly, my perception of "big" index is still vague :-) . All I'm trying
to make sure is that decision I take is scalable in the long term and will
be able to sustain the growth without compromising the performance.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
Sent from the Solr - User mailing list archive at Nabble.com.


How can I monitor the jetty thread pool

2016-03-07 Thread Yago Riveiro
Hi,

How can I monitor the jetty thread pool?

I want to do a zabbix graph with this info but the JMX doesn't show any
entry for this.



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-monitor-the-jetty-thread-pool-tp4262298.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Json API How to escape space in search string

2016-03-07 Thread Jack Krupansky
Backslash in JSON just tells JSON to escape the next character, while what
you really want is to pass a backslash through to the Solr query parser,
which you can do with a double backslash.

Alternatively, you could use quotes around the string in Solr, which would
require you to escape the quotes in JSON, with a single backslash.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 5:49 PM, Iana Bondarska  wrote:

> Hi All,
> could you please tell me if escaping special characters in search keywords
> works in json api.
> e.g. I have document
>  {
> "string_s":"new value"
> }
> And I want to query "string_s" field with keyword "new value".
> In path params api I can escape spaces in keyword as well as other special
> characters with \ .
> following query finds document:
> http://localhsot:8983/solr/dynamic_fields_qa/select?q=string_s:new\
> value&wt=json&indent=true
> But if I try to run same search using json api, nothing is found:
>
> http://localhsot:8983/solr/dynamic_fields_qa/select?q=*:*&json=
> {"query":"string_s:new\ value"}
>
> Best Regards,
> Iana Bondarska
>


Re: Solr Json API How to escape space in search string

2016-03-07 Thread Yonik Seeley
On Mon, Mar 7, 2016 at 5:49 PM, Iana Bondarska  wrote:
> Hi All,
> could you please tell me if escaping special characters in search keywords
> works in json api.
> e.g. I have document
>  {
> "string_s":"new value"
> }
> And I want to query "string_s" field with keyword "new value".
> In path params api I can escape spaces in keyword as well as other special
> characters with \ .
> following query finds document:
> http://localhsot:8983/solr/dynamic_fields_qa/select?q=string_s:new\
> value&wt=json&indent=true
> But if I try to run same search using json api, nothing is found:
>
> http://localhsot:8983/solr/dynamic_fields_qa/select?q=*:*&json=
> {"query":"string_s:new\ value"}

So the issue here is probably double-decoding... the JSON parser will
see the backslash-space and replace it with space only, and then the
lucene query parser will not see the backslash at all.

Either
1) use a double backslash
2) enclose the string in quotes (which will need to be backslash
escaped at the JSON level, or you can use the json single-quote
support.)

-Yonik


Solr Json API How to escape space in search string

2016-03-07 Thread Iana Bondarska
Hi All,
could you please tell me if escaping special characters in search keywords
works in json api.
e.g. I have document
 {
"string_s":"new value"
}
And I want to query "string_s" field with keyword "new value".
In path params api I can escape spaces in keyword as well as other special
characters with \ .
following query finds document:
http://localhsot:8983/solr/dynamic_fields_qa/select?q=string_s:new\
value&wt=json&indent=true
But if I try to run same search using json api, nothing is found:

http://localhsot:8983/solr/dynamic_fields_qa/select?q=*:*&json=
{"query":"string_s:new\ value"}

Best Regards,
Iana Bondarska


Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
20M docs is actually a very small collection by the "usual" Solr
standards unless they're _really_ large documents, i.e.
large books.

Actually, I wouldn't even shard to begin with, it's unlikely that it's
necessary and it adds inevitable overhead. If you _must_ shard,
just go with <1>, but again I would be surprised if it was even
necessary.

Best,
Erick

On Mon, Mar 7, 2016 at 2:35 PM, Shamik Bandopadhyay  wrote:
> Hi,
>
>   I'm trying to figure the best way to design/allocate shards for our Solr
> Cloud environment.Our current index has around 20 million documents, in 10
> languages. Around 25-30% of the content is in English. Rest are almost
> equally distributed among the remaining 13 languages. Till now, we had to
> deal with query time deduplication using collapsing parser  for which we
> used multi-level composite routing. But due to that, documents were
> disproportionately distributed across 3 shards. The shard containing the
> duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a
> 30gb index while Shard2 and Shard3 10gb each. The composite key is
> currently made of "language!dedup_id!url" . At query time, we are using
> shard.keys=language/8! for three level routing.
>
> Due to performance overhead, we decided to move the de-duplication logic
> during index time which made the composite routing redundant. We are not
> discarding the duplicate content so there's no change in index size.Before
> I update the routing key, just wanted to check what will be the best
> approach to the sharding architecture so that we get optimal performance.
> We've currently have 3 shards wth 2 replicas each. The entire index resides
> in one single collection. What I'm trying to understand is whether:
>
> 1. We let Solr use simple document routing based on id and route the
> documents to any of the 3 shards
> 2. We create a composite id using language, e.g. language!unique_id and
> make sure that the same language content will always be in same the shard.
> What I'm not sure is whether the index will be equally distributed across
> the three shards.
> 3. Index English only content to a dedicated shard, rest equally
> distributed to the remaining two. I'm not sure if that's possible.
> 4. Create a dedicated collection for English and one for rest of the
> languages.
>
> Any pointers on this will be highly appreciated.
>
> Regards,
> Shamik


Re: Solrcloud Batch Indexing

2016-03-07 Thread Erick Erickson
Bin:

The MRIT/Morphlines only makes sense if you have lots more
nodes devoted to the M/R jobs than you do Solr shards since the
actual work done to index a given doc is exactly the same either
with MRIT/Morphlines or just sending straight to Solr.

A bit of background here. I mentioned that MRIT/Morphlines uses
EmbeddedSolrServer. This is exactly Solr as far as the actual indexing
is concerned. So using --go-live is not buying you anything and, in fact,
is costing you quite a bit over just using <2> to index directly to Solr since
the index has to be copied around. I confess I'm surprised that --go-live
is taking that long. basically it's just copying your index up to Solr so
perhaps there's an I/O problem or some such.

OK, I'm lying a little bit here, _if_ you have more than one replica per
shard, then indexing straight to Solr will cost you (anecdotally)
10-15% in indexing speed. But if this is a single replica/shard (i.e.
leader-only), then it's near enough to being the exact same.

Anyway, at the end of the day, the index produced is self-contained.
You could even just copy it to your shards (with Solr down), and then
bring up your Solr nodes on a non-HDFS-based Solr.

But frankly I'd avoid that and benchmark on <2> first. My expectation
is that you'll be fine there and see indexing roughly on par with your
MRIT/Morphlines.

Now, all that said, indexing 300M docs in 'a few minutes' is a bit surprising.
I'm really wondering if you're not being fooled by something "odd". Have
you compared the identical runs with and without --go-live?

_Very_ often, the bottleneck isn't Solr at all, it's the data acquisition, so be
careful when measuring that the Solr CPU's are pegged... otherwise
you're bottlenecking upstream of Solr. A super-simple way to figure that
out is to comment out the solrServer.add(list, 1) line in <2> or just
run MRIT/Morphlines without the --go-live switch.

BTW, with <2> you could run with as many jobs as you wanted to run
the Solr servers flat-out.

FWIW,
Erick

On Mon, Mar 7, 2016 at 1:14 PM, Bin Wang  wrote:
> Hi Eric,
>
> Thanks for your quick response.
>
> From the data's perspective, we have 300+ million rows and believe it or
> not, the source data is from relational database (Hive) and the database is
> rebuilt every day (I am as frustrated as most of you who read this but it
> is what it is) and potentially need to store actually all of the fields.
> In this case, I have to figure out a solution to quickly index 300+ million
> rows as fast as I can.
>
> I am still at a stage evaluating all the different solutions, and I am
> sorry that I haven't really benchmarked the second approach yet.
> I will find a time to run some benchmark and share the result with the
> community.
>
> Regarding the approach that I suggested - mapreduce Lucene indexes, do you
> think it is feasible and does that worth the effort to dive into?
>
> Best regards,
>
> Bin
>
>
>
> On Mon, Mar 7, 2016 at 1:57 PM, Erick Erickson 
> wrote:
>
>> I'm wondering if you need map reduce at all ;)...
>>
>> The achilles heel with M/R viz: Solr is all the copying around
>> that's done at the end of the cycle. For really large bulk indexing
>> jobs, that's a reasonable price to pay..
>>
>> How many docs and how would you characterize them as far
>> as size, fields, etc? And what are your time requirements? What
>> kind of docs?
>>
>> I'm thinking this may be an "XY Problem". You're asking about
>> a specific solution before explaining the problem.
>>
>> Why do you say that Solr is not really optimized for bulk loading?
>> I took a quick look at <2> and the approach is sound. It batches
>> up the docs in groups of 1,000 and uses CloudSolrServer as it should.
>> Have you tried it? At the end of the day, MapReduceIndexerTool does
>> the same work to index a doc as a regular Solr server would via
>> EmbeddedSolrServer so if the number of tasks you have running is
>> roughly equal to the number of shards, it _should_ be roughly
>> comparable.
>>
>> Still, though, I have to repeat my question about how many docs you're
>> talking here. Using M/R inevitably adds complexity, what are you trying
>> to gain here that you can't get with several threads in a SolrJ client?
>>
>> Best,
>> Erick
>>
>> On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang  wrote:
>> > Hi there,
>> >
>> > I have a fairly big data set that I need to quick index into Solrcloud.
>> >
>> > I have done some research and none of them looked really good to me.
>> >
>> > (1) Kite Morphline: I managed to get it working, the mapreduce finished
>> in
>> > a few minutes which is good, however, it took a really long time, like
>> one
>> > hour (60 million), to merge the indexes into Solrcloud, the go-live part.
>> >
>> > (2) Mapreduce Using Solrcloud Server:
>> > <
>> http://techuserhadoop.blogspot.com/2014/09/mapreduce-job-for-indexing-documents-to.html
>> >
>> > this
>> > approach is pretty straightforward, however, every document has to funnel
>> > through the solrserver wh

Solr Cloud sharding strategy

2016-03-07 Thread Shamik Bandopadhyay
Hi,

  I'm trying to figure the best way to design/allocate shards for our Solr
Cloud environment.Our current index has around 20 million documents, in 10
languages. Around 25-30% of the content is in English. Rest are almost
equally distributed among the remaining 13 languages. Till now, we had to
deal with query time deduplication using collapsing parser  for which we
used multi-level composite routing. But due to that, documents were
disproportionately distributed across 3 shards. The shard containing the
duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a
30gb index while Shard2 and Shard3 10gb each. The composite key is
currently made of "language!dedup_id!url" . At query time, we are using
shard.keys=language/8! for three level routing.

Due to performance overhead, we decided to move the de-duplication logic
during index time which made the composite routing redundant. We are not
discarding the duplicate content so there's no change in index size.Before
I update the routing key, just wanted to check what will be the best
approach to the sharding architecture so that we get optimal performance.
We've currently have 3 shards wth 2 replicas each. The entire index resides
in one single collection. What I'm trying to understand is whether:

1. We let Solr use simple document routing based on id and route the
documents to any of the 3 shards
2. We create a composite id using language, e.g. language!unique_id and
make sure that the same language content will always be in same the shard.
What I'm not sure is whether the index will be equally distributed across
the three shards.
3. Index English only content to a dedicated shard, rest equally
distributed to the remaining two. I'm not sure if that's possible.
4. Create a dedicated collection for English and one for rest of the
languages.

Any pointers on this will be highly appreciated.

Regards,
Shamik


Warning and Error messages in Solr's log

2016-03-07 Thread Steven White
Hi folks,

In Solr's solr-8983-console.log I see the following (about 50 in a span of
24 hours when index is on going):

WARNING: Couldn't flush user prefs:
java.util.prefs.BackingStoreException: Couldn't get file lock.

What does it mean?  Should I wary about it?

What about this one:

118316292 [qtp114794915-39] ERROR org.apache.solr.core.SolrCore  [
test_idx] ? java.lang.IllegalStateException: file:
MMapDirectory@/b/vin291f1/vol/vin291f1v3/idx/solr_index/test/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@169f6ad3 appears
both in delegate and in cache: cache=[_2omj.fnm, _2omg_Lucene50_0.doc,
 _2omg.nvm],delegate=[write.lock, _1wuk.si,  segments_2b]
at
org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:103)

What does it mean?

I _think_ the error log is due to the NAS drive being disconnected before
shutting down Solr, but I need a Solr expect to confirm.

Unfortunately, I cannot find anything in solr.log files regarding this
because those files have rotated.

Thanks in advanced.

Steve


Re: Solrcloud Batch Indexing

2016-03-07 Thread Bin Wang
Hi Eric,

Thanks for your quick response.

>From the data's perspective, we have 300+ million rows and believe it or
not, the source data is from relational database (Hive) and the database is
rebuilt every day (I am as frustrated as most of you who read this but it
is what it is) and potentially need to store actually all of the fields.
In this case, I have to figure out a solution to quickly index 300+ million
rows as fast as I can.

I am still at a stage evaluating all the different solutions, and I am
sorry that I haven't really benchmarked the second approach yet.
I will find a time to run some benchmark and share the result with the
community.

Regarding the approach that I suggested - mapreduce Lucene indexes, do you
think it is feasible and does that worth the effort to dive into?

Best regards,

Bin



On Mon, Mar 7, 2016 at 1:57 PM, Erick Erickson 
wrote:

> I'm wondering if you need map reduce at all ;)...
>
> The achilles heel with M/R viz: Solr is all the copying around
> that's done at the end of the cycle. For really large bulk indexing
> jobs, that's a reasonable price to pay..
>
> How many docs and how would you characterize them as far
> as size, fields, etc? And what are your time requirements? What
> kind of docs?
>
> I'm thinking this may be an "XY Problem". You're asking about
> a specific solution before explaining the problem.
>
> Why do you say that Solr is not really optimized for bulk loading?
> I took a quick look at <2> and the approach is sound. It batches
> up the docs in groups of 1,000 and uses CloudSolrServer as it should.
> Have you tried it? At the end of the day, MapReduceIndexerTool does
> the same work to index a doc as a regular Solr server would via
> EmbeddedSolrServer so if the number of tasks you have running is
> roughly equal to the number of shards, it _should_ be roughly
> comparable.
>
> Still, though, I have to repeat my question about how many docs you're
> talking here. Using M/R inevitably adds complexity, what are you trying
> to gain here that you can't get with several threads in a SolrJ client?
>
> Best,
> Erick
>
> On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang  wrote:
> > Hi there,
> >
> > I have a fairly big data set that I need to quick index into Solrcloud.
> >
> > I have done some research and none of them looked really good to me.
> >
> > (1) Kite Morphline: I managed to get it working, the mapreduce finished
> in
> > a few minutes which is good, however, it took a really long time, like
> one
> > hour (60 million), to merge the indexes into Solrcloud, the go-live part.
> >
> > (2) Mapreduce Using Solrcloud Server:
> > <
> http://techuserhadoop.blogspot.com/2014/09/mapreduce-job-for-indexing-documents-to.html
> >
> > this
> > approach is pretty straightforward, however, every document has to funnel
> > through the solrserver which is really not optimized for bulk loading.
> >
> > Here is what I am thinking, is it possible to use Mapreduce to create a
> few
> > Lucene indexes first, for example, using 3 reducers to write three
> indexes.
> > Then create a Solr collection with three shards pointing to the generated
> > indexes. Can Solr easily pick up generated indexes?
> >
> > I am really new to Solr and wondering if this is feasible, and if there
> is
> > any work that has already been done. I am not really interested in
> cutting
> > the edge and any existing work should be appreciated!
> >
> > Best regards,
> >
> > Bin
>


Re: Solrcloud Batch Indexing

2016-03-07 Thread Erick Erickson
I'm wondering if you need map reduce at all ;)...

The achilles heel with M/R viz: Solr is all the copying around
that's done at the end of the cycle. For really large bulk indexing
jobs, that's a reasonable price to pay..

How many docs and how would you characterize them as far
as size, fields, etc? And what are your time requirements? What
kind of docs?

I'm thinking this may be an "XY Problem". You're asking about
a specific solution before explaining the problem.

Why do you say that Solr is not really optimized for bulk loading?
I took a quick look at <2> and the approach is sound. It batches
up the docs in groups of 1,000 and uses CloudSolrServer as it should.
Have you tried it? At the end of the day, MapReduceIndexerTool does
the same work to index a doc as a regular Solr server would via
EmbeddedSolrServer so if the number of tasks you have running is
roughly equal to the number of shards, it _should_ be roughly
comparable.

Still, though, I have to repeat my question about how many docs you're
talking here. Using M/R inevitably adds complexity, what are you trying
to gain here that you can't get with several threads in a SolrJ client?

Best,
Erick

On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang  wrote:
> Hi there,
>
> I have a fairly big data set that I need to quick index into Solrcloud.
>
> I have done some research and none of them looked really good to me.
>
> (1) Kite Morphline: I managed to get it working, the mapreduce finished in
> a few minutes which is good, however, it took a really long time, like one
> hour (60 million), to merge the indexes into Solrcloud, the go-live part.
>
> (2) Mapreduce Using Solrcloud Server:
> 
> this
> approach is pretty straightforward, however, every document has to funnel
> through the solrserver which is really not optimized for bulk loading.
>
> Here is what I am thinking, is it possible to use Mapreduce to create a few
> Lucene indexes first, for example, using 3 reducers to write three indexes.
> Then create a Solr collection with three shards pointing to the generated
> indexes. Can Solr easily pick up generated indexes?
>
> I am really new to Solr and wondering if this is feasible, and if there is
> any work that has already been done. I am not really interested in cutting
> the edge and any existing work should be appreciated!
>
> Best regards,
>
> Bin


Re: Solr Deserialize/Read .fdt file

2016-03-07 Thread Bin Wang
Hi Jack,

Thanks a lot for your response.

I agree the question is too much into Lucene which is outside the scope of
Solr.

However, for those of you who is interested in understanding the Solr Index
more, here are a few resources to help:
(1) Luke :  a local application that
help you inspect Lucene indexes.
(2) SimpleTextCodec
: it is not
that easily to deserialize Lucene indexes however, you can change the
default codec to be SimpleTextCodec and that will help you understand how
the index/search part much better.

Best regards,

Bin

On Sun, Mar 6, 2016 at 1:49 PM, Jack Krupansky 
wrote:

> Solr itself doesn't directly access index files - that is the
> responsibility of Lucene. That's why you see "lucene" in the class names,
> not "solr".
>
> To be clear, no Solr user will ever have to read or deserialize a .fdt
> file. Or any Lucene index file for that matter.
>
> If you actually do want to work at the Lucene level (which no one here will
> recommend), start with the Lucene doc:
>
> https://lucene.apache.org/core/documentation.html
> https://lucene.apache.org/core/5_5_0/index.html
>
> For File Formats:
>
> https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/codecs/lucene54/package-summary.html#package_description
>
> After that you will need to become much more familiar with the Lucene (not
> Solr) source code.
>
> If you want to trace through the code from Solr through Lucene, I suggest
> you start with Solr unit tests in Eclipse.
>
> But none of that will be an appropriate topic for users on this (Solr)
> list.
>
>
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 3:34 PM, Bin Wang  wrote:
>
> > Hi there, I am interested in understanding all the files in the index
> > folder.
> >
> > here 
> > is
> > a stackoverflow question that I have tried however failed.
> >
> > Can anyone provide some sample code to help me get started.
> >
> > Best regards,
> > Bin
> >
>


Solrcloud Batch Indexing

2016-03-07 Thread Bin Wang
Hi there,

I have a fairly big data set that I need to quick index into Solrcloud.

I have done some research and none of them looked really good to me.

(1) Kite Morphline: I managed to get it working, the mapreduce finished in
a few minutes which is good, however, it took a really long time, like one
hour (60 million), to merge the indexes into Solrcloud, the go-live part.

(2) Mapreduce Using Solrcloud Server:

this
approach is pretty straightforward, however, every document has to funnel
through the solrserver which is really not optimized for bulk loading.

Here is what I am thinking, is it possible to use Mapreduce to create a few
Lucene indexes first, for example, using 3 reducers to write three indexes.
Then create a Solr collection with three shards pointing to the generated
indexes. Can Solr easily pick up generated indexes?

I am really new to Solr and wondering if this is feasible, and if there is
any work that has already been done. I am not really interested in cutting
the edge and any existing work should be appreciated!

Best regards,

Bin


Re: Custom field using PatternCaptureGroupFilterFactory

2016-03-07 Thread Jack Krupansky
Great. And you shouldn't need the "{1}" - the square brackets match a
single character by definition.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 12:20 PM, Jay Potharaju 
wrote:

> Thanks Jack, the problem was my regex. Following regex worked.
>  "([a-zA-Z0-9]{1})" preserve_original="false"/>
> Jay
>
> On Sun, Mar 6, 2016 at 7:43 PM, Jack Krupansky 
> wrote:
>
> > The filter name, "Capture Group", says it all - only pattern groups are
> > captured and you have not specified even a single group. See the example:
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html
> >
> > Groups are each enclosed within parentheses, as shown in the Javadoc
> > example above.
> >
> > Since no groups were found, the filter doc applied this rule:
> > "If none of the patterns match, or if preserveOriginal is true, the
> > original token will be preserved."
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
> >
> > That should probably also say "or if no pattern groups match".
> >
> > To test regular expressions, try an interactive online tool, such as:
> > https://regex101.com/
> >
> > -- Jack Krupansky
> >
> > On Sun, Mar 6, 2016 at 7:51 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> > > I don't see the brackets that mark the group you actually want to
> > > capture. As per:
> > >
> > >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
> > >
> > > I am also not sure if you actually need "{0,1}" part.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Newsletter and resources for Solr beginners and intermediates:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 7 March 2016 at 04:25, Jay Potharaju  wrote:
> > > > Hi,
> > > > I have a custom field for getting the first letter of an firstname.
> For
> > > > this I am using PatternCaptureGroupFilterFactory.
> > > > This is not working as expected, not able to parse the data and get
> the
> > > > first character for the string. Any suggestions on how to fix this?
> > > >
> > > >  
> > > >
> > > >   
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > >  pattern=
> > > > "^[a-zA-Z0-9]{0,1}" preserve_original="false"/>
> > > >
> > > >
> > > >
> > > > 
> > > >
> > > > --
> > > > Thanks
> > > > Jay
> > >
> >
>
>
>
> --
> Thanks
> Jay Potharaju
>


Re: Stopping Solr JVM on OOM

2016-03-07 Thread Shawn Heisey
On 2/25/2016 2:06 PM, Fuad Efendi wrote:
> The best practice: do not ever try to catch Throwable or its descendants 
> Error, VirtualMachineError, OutOfMemoryError, and etc. 
>
> Never ever.
>
> Also, do not swallow InterruptedException in a loop.
>
> Few simple rules to avoid hanging application. If we follow these, there will 
> be no question "what is the best way to stop Solr when it gets in OOM” (or 
> just becomes irresponsive because of swallowed exceptions)

As I understand from SOLR-8539, if an OOM is thrown by a Java program
and there is a properly configured OOM script, regardless of what
happens with exception rewrapping, the script *should* kick in.  Here's
an issue where this behavior was verified by a Jetty developer on a
small-scale test program which catches and swallows the OOM:

https://issues.apache.org/jira/browse/SOLR-8539

Solr 5.x, when started on Linux/UNIX systems with the included shell
scripts, comes default with an "oom killer" script that is supposed to
stop Solr when OOM occurs.

Recently it was discovered that the OnOutOfMemoryError option in the
start script for Linux/UNIX was being incorrectly specified on the
command line -- it doesn't actually work.  Here's the issue for that
problem:

https://issues.apache.org/jira/browse/SOLR-8145

The fix for the incorrect OnOutOfMemoryError usage will be in version
6.0 when that version is finally released, which I think will make the
OOM killer actually work on Linux/UNIX.  There is currently no concrete
information on when 6.0 is expected.  If any plans for future 5.x
versions come up, that fix will likely make it into those versions as well.

There is no OOM killer script for Windows, so this feature is not
present when running on Windows.  If somebody can come up with a way for
Windows to find and kill the Solr process, I'd be happy to include it.

Thanks,
Shawn



Re: Custom field using PatternCaptureGroupFilterFactory

2016-03-07 Thread Jay Potharaju
Thanks Jack, the problem was my regex. Following regex worked.

Jay

On Sun, Mar 6, 2016 at 7:43 PM, Jack Krupansky 
wrote:

> The filter name, "Capture Group", says it all - only pattern groups are
> captured and you have not specified even a single group. See the example:
>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html
>
> Groups are each enclosed within parentheses, as shown in the Javadoc
> example above.
>
> Since no groups were found, the filter doc applied this rule:
> "If none of the patterns match, or if preserveOriginal is true, the
> original token will be preserved."
>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
>
> That should probably also say "or if no pattern groups match".
>
> To test regular expressions, try an interactive online tool, such as:
> https://regex101.com/
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 7:51 PM, Alexandre Rafalovitch 
> wrote:
>
> > I don't see the brackets that mark the group you actually want to
> > capture. As per:
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
> >
> > I am also not sure if you actually need "{0,1}" part.
> >
> > Regards,
> >Alex.
> > 
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 7 March 2016 at 04:25, Jay Potharaju  wrote:
> > > Hi,
> > > I have a custom field for getting the first letter of an firstname. For
> > > this I am using PatternCaptureGroupFilterFactory.
> > > This is not working as expected, not able to parse the data and get the
> > > first character for the string. Any suggestions on how to fix this?
> > >
> > >  
> > >
> > >   
> > >
> > > 
> > >
> > > 
> > >
> > >  > > "^[a-zA-Z0-9]{0,1}" preserve_original="false"/>
> > >
> > >
> > >
> > > 
> > >
> > > --
> > > Thanks
> > > Jay
> >
>



-- 
Thanks
Jay Potharaju


Re: Stopping Solr JVM on OOM

2016-03-07 Thread Muhammad Zahid Iqbal
You can use ping functionality by setting time-out that suits for your
container/web-apps. If its not working then you can restart your container.
Cheers!

If any other solution I am interested too.

On Fri, Feb 26, 2016 at 2:19 AM, CP Mishra  wrote:

> Solr & Lucene dev folks must be catching Throwable for a reason. Anyway, I
> am asking for solutions that I can use.
>
> On Thu, Feb 25, 2016 at 3:06 PM, Fuad Efendi  wrote:
>
> > The best practice: do not ever try to catch Throwable or its descendants
> > Error, VirtualMachineError, OutOfMemoryError, and etc.
> >
> > Never ever.
> >
> > Also, do not swallow InterruptedException in a loop.
> >
> > Few simple rules to avoid hanging application. If we follow these, there
> > will be no question "what is the best way to stop Solr when it gets in
> OOM”
> > (or just becomes irresponsive because of swallowed exceptions)
> >
> >
> > --
> > Fuad Efendi
> > 416-993-2060(cell)
> >
> > On February 25, 2016 at 2:37:45 PM, CP Mishra (mishr...@gmail.com)
> wrote:
> >
> > Looking at the previous threads (and in our tests), oom script specified
> > at
> > command line does not work as OOM exception is trapped and converted to
> > RuntimeException. So, what is the best way to stop Solr when it gets in
> > OOM
> > state? The only way I see is to override multiple handlers and do
> > System.exit() from there. Is there a better way?
> >
> > We are using Solr with default Jetty container.
> >
> > Thanks,
> > CP Mishra
> >
> >
>


Re: Text search NGram

2016-03-07 Thread Jack Krupansky
Absolutely, but so what? Nothing in any Solr query is going to be based on
character position.

Also, adding and removing characters in a char filter is a really bad idea
if you might want to do highlighting since the token character position
would not line up with the original source text.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 10:33 AM, G, Rajesh  wrote:

> Hi Jack,
>
>
>
> Please correct me if iam wrong I added Char filter because
>
>
>
> In Analyzer[solr ui]  I have provided "Microsoft office" in Field Value
> (Index) now WhitespaceTokenizerFactory produces the below result Office
> starts at 10. if I leave additional space say 2 more spaces Office starts
> at 12 should it not start at 10?
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 10
>
>
> 16
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 12
>
>
> 18
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> Thanks
>
> Rajesh
>
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Monday, March 7, 2016 8:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text search NGram
>
>
>
> The charFilter isn't doing anything useful - the white space tokenzier
> will ignore extra white space anyway.
>
>
>
> -- Jack Krupansky
>
>
>
> On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh  r...@cebglobal.com>> wrote:
>
>
>
> > Hi Team,
>
> >
>
> > We have the blow type and we have indexed the value  "title":
>
> > "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio
>
> > 8.0.61205.56 (2005)"
>
> >
>
> > When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
>
> > Microsoft Visual Studio 2006 as first record. I wanted to have
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the
>
> > user has searched for Microsoft Visual Studio 2005. Can you please help?.
>
> >
>
> > We are using NGram so it takes care of misspelled or jumbled words[it
>
> > works as expected] e.g.
>
> > searching Micrs Visual Studio will gets Microsoft Visual Studio
>
> > searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> >
>
> >   
> > positionIncrementGap="0" >
>
> > 
>
> > 
> > class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement="
> "/>
>
> > 
> > class="solr.WhitespaceTokenizerFactory"/>
>
> > 
> > class="solr.LowerCaseFilterFactory"/>
>
> > 
> > minGramSize="2" maxGramSize="800"/>
>
> > 
>
> >  
>
> > 
> > class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement="
> "/>
>
> > 
> > class="solr.WhitespaceTokenizerFactory"/>
>
> > 
> > class="solr.LowerCaseFilterFactory"/>
>
> > 
> > minGramSize="2" maxGramSize="800"/>
>
> > 
>
> >   
>
> >
>
> >
>
> >
>
> > Corporate Executive Board India Private Limited. Registration No:
>
> > U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF
>
> > Building
>
> > No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
> >
>
> >
>
> >
>
> > This e-mail and/or its attachments are intended only for the use of
>
> > the
>
> > addressee(s) and may contain confidential and legally privileged
>
> > information belonging to CEB and/or its subsidiaries, including CEB
>
> > subsidiaries that offer SHL Talent Measurement products and services.
>
> > If you have received this e-mail in error, please notify the sender
>
> > and immediately, destroy all copies of this email and its attachments.
>
> > The p

RE: Text search NGram

2016-03-07 Thread G, Rajesh
Hi Jack,



Please correct me if iam wrong I added Char filter because



In Analyzer[solr ui]  I have provided "Microsoft office" in Field Value (Index) 
now WhitespaceTokenizerFactory produces the below result Office starts at 10. 
if I leave additional space say 2 more spaces Office starts at 12 should it not 
start at 10?



text


raw_bytes


start


end


positionLength


type


position




microsoft


[6d 69 63 72 6f 73 6f 66 74]


0


9


1


word


1




office


[6f 66 66 69 63 65]


10


16


1


word


2






text


raw_bytes


start


end


positionLength


type


position




microsoft


[6d 69 63 72 6f 73 6f 66 74]


0


9


1


word


1




office


[6f 66 66 69 63 65]


12


18


1


word


2






Thanks

Rajesh





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.



-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Monday, March 7, 2016 8:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram



The charFilter isn't doing anything useful - the white space tokenzier will 
ignore extra white space anyway.



-- Jack Krupansky



On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh 
mailto:r...@cebglobal.com>> wrote:



> Hi Team,

>

> We have the blow type and we have indexed the value  "title":

> "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio

> 8.0.61205.56 (2005)"

>

> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get

> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and

> Microsoft Visual Studio 2006 as first record. I wanted to have

> Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the

> user has searched for Microsoft Visual Studio 2005. Can you please help?.

>

> We are using NGram so it takes care of misspelled or jumbled words[it

> works as expected] e.g.

> searching Micrs Visual Studio will gets Microsoft Visual Studio

> searching Visual Microsoft Studio will gets Microsoft Visual Studio

>

>positionIncrementGap="0" >

> 

>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>

>  class="solr.WhitespaceTokenizerFactory"/>

>  class="solr.LowerCaseFilterFactory"/>

>  minGramSize="2" maxGramSize="800"/>

> 

>  

>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>

>  class="solr.WhitespaceTokenizerFactory"/>

>  class="solr.LowerCaseFilterFactory"/>

>  minGramSize="2" maxGramSize="800"/>

> 

>   

>

>

>

> Corporate Executive Board India Private Limited. Registration No:

> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF

> Building

> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

>

>

>

> This e-mail and/or its attachments are intended only for the use of

> the

> addressee(s) and may contain confidential and legally privileged

> information belonging to CEB and/or its subsidiaries, including CEB

> subsidiaries that offer SHL Talent Measurement products and services.

> If you have received this e-mail in error, please notify the sender

> and immediately, destroy all copies of this email and its attachments.

> The publication, copying, in whole or in part, or use or dissemination

> in any other way of this e-mail and attachments by anyone other than

> the intended

> person(s) is prohibited.

>

>

>


Re: Text search NGram

2016-03-07 Thread Jack Krupansky
The charFilter isn't doing anything useful - the white space tokenzier will
ignore extra white space anyway.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh  wrote:

> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56
> (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected]
> e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
>positionIncrementGap="0" >
> 
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>  
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>   
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>


RE: Text search NGram

2016-03-07 Thread G, Rajesh
Hi Emir,

I got it. Thanks Emir it was helpful

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 8:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with ngram 
only solution:


 
 
 
 
 
 
  
 
 
 
 
 
   


and to add new field type and field to keep nonngram version of field.
Something like:


 
 
 
 
 
  
 
 
 
 
   


and use copyField to copy to both fields and query title:test OR 
title_simple:test.

Emir


On 07.03.2016 15:31, G, Rajesh wrote:
> Hi Emir,
>
> I have already applied
>
>  and then I have applied 
> . 
> Is this what you wanted me to have in my config?
>
> Thanks
> Rajesh
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
> -Original Message-
> From: G, Rajesh [mailto:r...@cebglobal.com]
> Sent: Monday, March 7, 2016 7:50 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Text search NGram
>
> Hi Emir,
>
> Thanks for you email. Can you please help me to understand what do you mean 
> by "e.g. boost if matching tokenized fileds to make sure exact matches are 
> ordered first"
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
> -Original Message-
> From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
> Sent: Monday, March 7, 2016 7:36 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text search NGram
>
> Hi Rajesh,
> It is most likely related to norms - you can try setting omitNorms="true" and 
> reindexing content. Anyway, it is not common to use just ngrams for matching 
> content - in such case you can expect more unexpected ordering/results. You 
> should combine ngrams fields with normally tokenized fields (e.g. boost if 
> matching tokenized fileds to make sure exact matches are ordered first).
>
> Regards,
> Emir
>
> On 07.03.2016 11:44, G, Rajesh wrote:
>> Hi Team,
>>
>> We have the blow type and we have indexed the value  "title": "Microsoft 

Re: Text search NGram

2016-03-07 Thread Emir Arnautovic
Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with 
ngram only solution:









 





  


and to add new field type and field to keep nonngram version of field. 
Something like:








 




  


and use copyField to copy to both fields and query title:test OR 
title_simple:test.


Emir


On 07.03.2016 15:31, G, Rajesh wrote:

Hi Emir,

I have already applied

 and then I have applied . Is this what you wanted me to 
have in my config?

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Monday, March 7, 2016 7:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Text search NGram

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by "e.g. 
boost if matching tokenized fileds to make sure exact matches are ordered first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching content - in 
such case you can expect more unexpected ordering/results. You should combine ngrams 
fields with normally tokenized fields (e.g. boost if matching tokenized fileds to make 
sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it
works as expected] e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio


  
  
  
  
  
  
   
  
  
  
  
  




Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, 

Re: Text search NGram

2016-03-07 Thread Emir Arnautovic

Hi Rajesh,
Solution includes 2 fields - one "ngram" field (like your txt_token) and 
other "nonngram" field - just tokenized (like your txt_token without 
ngram token filter). If you have two documents:

1. ABCDEF
2. ABCD
And you are searching for ABCD, if you use only ngram field, both are 
matches and doc 1 can be first, but if you search from ngram:ABCD OR 
nonngram:ABCD, doc 2 will have higher score.


Regards,
Emir

On 07.03.2016 15:20, G, Rajesh wrote:

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by "e.g. 
boost if matching tokenized fileds to make sure exact matches are ordered first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching content - in 
such case you can expect more unexpected ordering/results. You should combine ngrams 
fields with normally tokenized fields (e.g. boost if matching tokenized fileds to make 
sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it
works as expected] e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio


  
  
  
  
  
  
   
  
  
  
  
  




Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & 
Elasticsearch Support * http://sematext.com/



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



RE: Text search NGram

2016-03-07 Thread G, Rajesh
Hi Emir,

I have already applied

 and then I have applied 
. Is 
this what you wanted me to have in my config?

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Monday, March 7, 2016 7:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Text search NGram

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by 
"e.g. boost if matching tokenized fileds to make sure exact matches are ordered 
first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching 
content - in such case you can expect more unexpected ordering/results. You 
should combine ngrams fields with normally tokenized fields (e.g. boost if 
matching tokenized fileds to make sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:
> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft 
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56 (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get 
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and  
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft 
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched 
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected] e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> positionIncrementGap="0" >
>  
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>   
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the 

RE: Text search NGram

2016-03-07 Thread G, Rajesh
Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by 
"e.g. boost if matching tokenized fileds to make sure exact matches are ordered 
first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching 
content - in such case you can expect more unexpected ordering/results. You 
should combine ngrams fields with normally tokenized fields (e.g. boost if 
matching tokenized fileds to make sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:
> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft 
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56 (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get 
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and  
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft 
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched 
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected] e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> positionIncrementGap="0" >
>  
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>   
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
>

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & 
Elasticsearch Support * http://sematext.com/



Re: Text search NGram

2016-03-07 Thread Emir Arnautovic

Hi Rajesh,
It is most likely related to norms - you can try setting 
omitNorms="true" and reindexing content. Anyway, it is not common to use 
just ngrams for matching content - in such case you can expect more 
unexpected ordering/results. You should combine ngrams fields with 
normally tokenized fields (e.g. boost if matching tokenized fileds to 
make sure exact matches are ordered first).


Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it works as 
expected]
e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio

   
 
 
 
 
 
 
  
 
 
 
 
 
   



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



timeAllowed behavior

2016-03-07 Thread Anatoli Matuskova
Hey there,

I'm a bit lots with timeAllowed lately. I'm not using solr cloud and have a
monolitic index. I have the Solr version 4.5.1 in production. Now I'm
testing Solr 5 and timeAllowed seems to behave different. In 4.5, when it
was hit, it used to return the partial results it could collect. Now, when
the timeAllowed is hit it's not returning and partial document.

Is that normal?

Thanks in advance.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/timeAllowed-behavior-tp4262110.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Text search NGram

2016-03-07 Thread G, Rajesh
Hi Binoy,

It is Standard Query Parser

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Binoy Dalal [mailto:binoydala...@gmail.com]
Sent: Monday, March 7, 2016 5:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

What query parser are you using?

Additionally, run the same query with &debugQuery=true and see how your results 
are being scored to find out why the ms vs 2006 shows up before 2005.

On Mon, 7 Mar 2016, 16:14 G, Rajesh,  wrote:

> Hi Team,
>
> We have the blow type and we have indexed the value  "title":
> "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio
> 8.0.61205.56 (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
> Microsoft Visual Studio 2006 as first record. I wanted to have
> Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the
> user has searched for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected] e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
>positionIncrementGap="0" >
> 
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>  
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>   
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF
> Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of
> the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services.
> If you have received this e-mail in error, please notify the sender
> and immediately, destroy all copies of this email and its attachments.
> The publication, copying, in whole or in part, or use or dissemination
> in any other way of this e-mail and attachments by anyone other than
> the intended
> person(s) is prohibited.
>
>
> --
Regards,
Binoy Dalal


Solr crash

2016-03-07 Thread Mugeesh Husain
Hello everyone,
Could you suggest me in case of so crashe?

I am writing a script for so crash, in which logic, I should be implementing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-crash-tp4262090.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spatial Search on Postal Code

2016-03-07 Thread Manohar Sripada
Thanks Again Emir! I will try this way.

Thanks David! It looks like building of polygons at index time is better
option than at query time.

Thanks,
Manohar

On Sat, Mar 5, 2016 at 7:54 PM, david.w.smi...@gmail.com <
david.w.smi...@gmail.com> wrote:

> Another path to consider is doing this point-in-zipcode-poly lookup at
> index time and enriching the document with a zipcode field (possibly
> multi-valued if there is doubt).
>
> On Sat, Mar 5, 2016 at 4:05 AM steve shepard 
> wrote:
>
> > re: Postal Codes and polygons. I've heard of basic techniques that use
> > Commerce Department (or was it Census within Commerce??) that give the
> > basic points, but the real run is deciding what the "center" of that
> > polygon is. There is likely a commercial solution available, and
> certainly
> > you can buy a spreadsheet with the zipcodes and their guestimated center.
> > Fun project!
> >
> > > Subject: Re: Spatial Search on Postal Code
> > > To: solr-user@lucene.apache.org
> > > From: emir.arnauto...@sematext.com
> > > Date: Fri, 4 Mar 2016 21:18:10 +0100
> > >
> > > Hi Manohar,
> > > I don't think there is such functionality in Solr - you need to do it
> on
> > > client side:
> > > 1. find some postal code polygons (you can use open street map -
> > > http://wiki.openstreetmap.org/wiki/Key:postal_code)
> > > 2. create zip to polygon lookup
> > > 3. create code that will expand zip code polygon by some distance (you
> > > can use JTS buffer api)
> > >
> > > On query time you get zip code and distance:
> > > 1. find polygon for zip
> > > 2. expand polygon
> > > 3. send resulting polygon to Solr and use Intersects function to filter
> > > results
> > >
> > > Regards,
> > > Emir
> > >
> > > On 04.03.2016 19:49, Manohar Sripada wrote:
> > > > Thanks Emir,
> > > >
> > > > Obviously #2 approach is much better. I know its not straight
> forward.
> > But,
> > > > is it really acheivable in Solr? Like building a polygon for a postal
> > code.
> > > > If so, can you throw some light how to do?
> > > >
> > > > Thanks,
> > > > Manohar
> > > >
> > > > On Friday, March 4, 2016, Emir Arnautovic <
> > emir.arnauto...@sematext.com>
> > > > wrote:
> > > >
> > > >> Hi Manohar,
> > > >> This depends on your requirements/usecase. If postal code is
> > interpreted
> > > >> as point than it is expected to have radius that is significantly
> > larger
> > > >> than postal code diameter. In such case you can go with first
> > approach. In
> > > >> order to avoid missing results from postal code in case of small
> > search
> > > >> radius and large postal code, you can reverse geocode records and
> > store
> > > >> postal code with each document.
> > > >> If you need to handle distance from postal code precisely - distance
> > from
> > > >> its border, you have to get postal code polygon, expand it by search
> > > >> distance and use resulting polygon to find matches.
> > > >>
> > > >> HTH,
> > > >> Emir
> > > >>
> > > >> On 04.03.2016 13:09, Manohar Sripada wrote:
> > > >>
> > > >>> Here's my requirement -  User enters postal code and provides the
> > radius.
> > > >>> I
> > > >>> need to find the records with in the radius from the provided
> postal
> > code.
> > > >>>
> > > >>> There are few ways I thought through after going through the
> "Spatial
> > > >>> Search" Solr wiki
> > > >>>
> > > >>> 1. As Latitude and Longitude positions are required for spatial
> > search.
> > > >>> Get
> > > >>> Latitude Longitude position (may be using GeoCoding API) of a
> postal
> > code
> > > >>> and use "LatLonType" field type and query accordingly. As the
> > GeoCoding
> > > >>> API
> > > >>> returns one point and if the postal code area is too big, then I
> may
> > end
> > > >>> up
> > > >>> not getting any results (apart from the records from the same
> postal
> > code)
> > > >>> if the radius provided is small.
> > > >>>
> > > >>> 2. Get the latitude longitude points of the postal code which
> forms a
> > > >>> border (not sure yet on how to get) and build a polygon (using
> RPT).
> > While
> > > >>> querying use this polygon and provide the distance. Can this be
> > achieved?
> > > >>> Or Am I ruminating too much? :(
> > > >>>
> > > >>> Appreciate any help on this.
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>>
> > > >> --
> > > >> Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> > > >> Solr & Elasticsearch Support * http://sematext.com/
> > > >>
> > > >>
> > >
> > > --
> > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> >
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>


Re: Text search NGram

2016-03-07 Thread Binoy Dalal
What query parser are you using?

Additionally, run the same query with &debugQuery=true and see how your
results are being scored to find out why the ms vs 2006 shows up before
2005.

On Mon, 7 Mar 2016, 16:14 G, Rajesh,  wrote:

> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56
> (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected]
> e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
>positionIncrementGap="0" >
> 
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>  
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>   
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
> --
Regards,
Binoy Dalal


Re: [Migration Solr4 to Solr5] Collection reload error

2016-03-07 Thread Gerald Reinhart


Hi,

 To give you some context, we are migrating from Solr4 and solr5,
the client code and the configuration haven't changed but now we are
facing this problem. We have already checked the commit behaviour
configuration and it seems good.

Here it is :

Server side, we have 2 collections (main and temp with blue and green
aliases) :

   solrconfig.xml:

  
  
 (...)
 
   90
   false
 

 

  

Client side, we have 2 different modes:

1 - Full recovery :

- Delete all documents from the temp collection
  solrClient.deleteByQuery("*:*")

- Add all new documents in temp collection (can be more
than 5Millions),
  solrClient.add(doc, -1) // commitWithinMs == -1

-  Commit when all documents are added
  solrClient.commit(false,false) // waitFlush == false ,
waitSearcher == false

-  Swap blue and green using "create alias" command

-  Reload the temp collection to clean the cache. This is
at this point we have the issue.

2 - Incremental :

-  Add or delete documents from the main collection
   solrClient.add(doc, 180)   // commitWithin
== 30 mn
   solrClient.deleteById(doc, 180) // commitWithin == 30 mn

Maybe you will spot something obviously wrong ?

Thanks

Gérald and Elodie



On 03/04/2016 12:41 PM, Dmitry Kan wrote:

Hi,

Check the the autoCommit and autoSoftCommit nodes in the solrconfig.xml.
Set them to reasonable values. The idea is that if you commit too often,
searchers will be warmed up and thrown away. If at any point in time you
get overlapping commits, there will be several searchers sitting on the
deck.

Dmitry

On Mon, Feb 29, 2016 at 4:20 PM, Gerald Reinhart 
wrote:
Hi,

We are facing an issue during a migration from Solr4 to Solr5.

Given
- migration from solr 4.10.4 to 5.4.1
- 2 collections
- cloud with one leader and several replicas
- in solrconfig.xml: maxWarmingSearchers=1
- no code change

When collection reload using /admin/collections using solrj

Then

2016-02-29 13:42:49,011 [http-8080-3] INFO
org.apache.solr.core.CoreContainer:reload:848  - Reloading SolrCore
'fr_blue' using configuration from collection fr_blue
2016-02-29 13:42:45,428 [http-8080-6] INFO
org.apache.solr.search.SolrIndexSearcher::237  - Opening
Searcher@58b65fc[fr_blue] main
(...)
2016-02-29 13:42:49,077 [http-8080-3] WARN
org.apache.solr.core.SolrCore:getSearcher:1762  - [fr_blue] Error
opening new searcher. exceeded limit of maxWarmingSearchers=1, try again
later.
2016-02-29 13:42:49,091 [http-8080-3] ERROR
org.apache.solr.handler.RequestHandlerBase:log:139  -
org.apache.solr.common.SolrException: Error handling 'reload' action
 at

org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:770)
 at

org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:230)
 at

org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:184)
 at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
 at

org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
 at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438)
 at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:223)
 at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
 at

org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at

org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at

org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at

org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at

org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at

org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:555)
 at

org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
 at

org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
 at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Unable to reload core
[fr_blue]
 at
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:854)
 at

org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler

Text search NGram

2016-03-07 Thread G, Rajesh
Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual 
Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it works as 
expected]
e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio

  






 





  



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




Re: High Cpu sys usage

2016-03-07 Thread Toke Eskildsen
On Sun, 2016-03-06 at 08:26 -0700, Shawn Heisey wrote:
> On 3/5/2016 11:44 PM, YouPeng Yang wrote:
> >   We are using Solr Cloud 4.6 in our production for searching service
> > since 2 years ago.And now it has 700GB in one cluster which is  comprised
> > of 3 machines with ssd. At beginning ,everything go well,while more and
> > more business services interfered with our searching service .And a problem
> >  which we haunted with is just like a  nightmare . That is the cpu sys
> > usage is often growing up to  over 10% even higher, and as a result the
> > machine will hang down because system resources have be drained out.We have
> > to restart the machine manually.

> One of the most common reasons for performance issues with Solr is not
> having enough system memory to effectively cache the index. [...]

How does this relate to YouPeng reporting that the CPU usage increases?

This is not a snark. YouPeng mentions kernel issues. It might very well
be that IO is the real problem, but that it manifests in a non-intuitive
way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am
not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO
system is struggling, even if IO-Wait is low?

YouPeng: If you are on a *nix-system, can you call 'top' on a machine
and copy-paste the output somewhere we can see?


- Toke Eskildsen, State and University Library, Denmark