date:20150611

Re: Show all fields in Solr highlighting output

2015-06-11 Thread Ahmet Arslan

Hi Edwin,

I think Highlighting Behaviour of those types shifts over time. May be we 
should do the reverse. 
Move snippets to main response: https://issues.apache.org/jira/browse/SOLR-3479

Ahmet



On Thursday, June 11, 2015 11:23 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com 
wrote:
Hi Ahmet,

I've tried that, but it's still not able to show.

Those fields are actually of type=float, type=date and type=int.

By default those field type are not able to be highlighted?

Regards,
Edwin




On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote:

 Hi Edwin,

 hl.alternateField is probably what you are looking for.

 ahmet




 On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo 
 edwinye...@gmail.com wrote:
 Hi,

 Is it possible to list all the fields in the highlighting portion in the
 output?
 Currently,even when I str name=hl.fl*/str, it only shows fields where
 highlighting is possible, and fields which highlighting is not possible is
 not shown.

 I would like to have the output where all the fields, regardless if
 highlighting is possible or not, to be shown together.


 Regards,
 Edwin

Re: Problem with german hyphenated words not being found

2015-06-11 Thread Upayavira

The next thing to do is add debugQuery=true to your URL (or enable it in
the query pane of the admin UI). Then look for the parsed query info.

On the standard text_en field which includes an English stop word
filter, I ran a query on Jack and Jill's House which showed
this output:

rawquerystring: text_en:(Jack and Jill's House), querystring:
text_en:(Jack and Jill's House), parsedquery: text_en:jack
text_en:jill text_en:hous, parsedquery_toString: text_en:jack
text_en:jill text_en:hous,

You can see that the parsed query is formed *after* analysis, so you can
see exactly what is being queried for.

Also, as a corollary to this, you can use the schema browser (or
faceting for that matter) to view what terms are being indexed, to see
if they should match.

HTH

Upayavira

 Am 11.06.2015 12:00 schrieb Upayavira:


 Have you used the analysis tab in the admin UI? You can type in
sentences for both index and query time and see how they would be
analysed by various fields/field types.

Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:
 Hey, in german, you can string most nouns together by using hyphens,
 like this: Industrie = industry Anhänger = trailer Industrie-
 Anhänger = trailer for industrial use Here [1[1]], you can see me
 querying Industrieanhänger from the name field
 (name:Industrieanhänger), to make sure the index actually contains
 the word. Our data is structured that products are listed without
 the hyphen. Now, customers can come around and use the hyphenated
 version as a search term (i.e.industrie-anhänger), and of course
 we want them to find what they are looking for. I've set it up so
 that the WordDelimiterFilterFactory uses catenateWords=1, so that
 these words are catenated. An analysis of Industrieanhänger as
 index and industrie-anhänger as query can be seen here [2[2]]. You
 can see that both word parts are found. However, querying for industrie-
 anhänger does not yield results, only when the hyphen is removed,
 as you can see here [3[3]]. I'm not sure how to proceed from here,
 as the results of the analysis have so far always lined up with what
 I could see when querying. Here's the schema definition for text,
 the field type for the name field: fieldType name=text
 class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true analyzer type=index tokenizer
 class=solr.StandardTokenizerFactory/ filter
 class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=0 catenateAll=0
 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
 dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
 maxSubwordSize=30 onlyLongestMatch=false/ filter
 class=solr.StopFilterFactory words=stopwords.txt
 ignoreCase=true enablePositionIncrements=true
 format=snowball/ filter
 class=solr.GermanNormalizationFilterFactory/ filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=protwords.txt/ filter
 class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer
 analyzer type=query tokenizer
 class=solr.WhitespaceTokenizerFactory/ filter
 class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=0 catenateAll=0
 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/
 !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory
 dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
 maxSubwordSize=30 onlyLongestMatch=false/ -- filter
 class=solr.StopFilterFactory words=stopwords.txt
 ignoreCase=true enablePositionIncrements=true
 format=snowball/ filter
 class=solr.GermanNormalizationFilterFactory/ filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=protwords.txt/ filter
 class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer
 /fieldType I've also thought it might be a problem with URL
 encoding not encoding the hyphen, but replacing it with %2D didn't
 change the outcome (and was probably wrong anyway). Any help is
 greatly appreciated. Links: -- [1] http://imgur.com/2oEC5vz [2]
 http://i.imgur.com/H0AhEsF.png [3] http://imgur.com/dzmMe7t



Links:

  1. http://imgur.com/2oEC5vz
  2. http://i.imgur.com/H0AhEsF.png
  3. http://imgur.com/dzmMe7t

Re: DocTransformers for restructuring output, e.g. Highlighting

2015-06-11 Thread Upayavira

Yes! It only needs to be done!

On Thu, Jun 11, 2015, at 11:38 AM, Ahmet Arslan wrote:
 Hi Upayavira,
 
 I was going to suggest SOLR-3479 to Edwin, I saw your old post.
 
 Regarding your suggestion, there is an existing ticket : 
 https://issues.apache.org/jira/browse/SOLR-3479
 
 I think SOLR-7665 is also relevant to your question.
 
 Ahmet
  
 
 
 
 On Sunday, June 23, 2013 9:54 PM, Upayavira u...@odoko.co.uk wrote:
 I've just taken a peek at the src for DocTransformers. They get given a
 TransformContext. That context contains the query and a few other bits
 and pieces.
 
 If it contained the response, DocTransformers would be able to do output
 restructuring. The best example is hit highlighting. If you did:
 
 hl=onhl.fl=namefl=*,[highlight:name]
 
 you would no longer need to seek the highlighted strings in another part
 of the output.
 
 The conceptual downside of this approach is that we might expect the
 highlighting to be done inside the DocTransfomer not a search component,
 i.e. not needing the hl=onhl.fl=name bit. That is, this would be a
 great change for existing Solr users, but might be confusing for new
 Solr users.
 
 I did try to move the highlighting code itself into the DocTransformer,
 but stalled at the point at which it needed to be CoreAware, as
 DocTransformers aren't allowed to be. Without that, it isn't possible to
 access the Highlighter components in the core's configuration.
 
 Thoughts? Is this a useful feature?
 
 Upayavira

Re: Problem with german hyphenated words not being found

2015-06-11 Thread Upayavira

Have you used the analysis tab in the admin UI? You can type in
sentences for both index and query time and see how they would be
analysed by various fields/field types.

Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:
  Hey,
 
 in german, you can string most nouns together by using hyphens, like
 this:
 
 Industrie = industry
 Anhänger = trailer
 
 Industrie-Anhänger = trailer for industrial use
 
 Here [1], you can see me querying Industrieanhänger from the name
 field (name:Industrieanhänger), to make sure the index actually contains
 the word. Our data is structured that products are listed without the
 hyphen.
 
 Now, customers can come around and use the hyphenated version as a
 search term (i.e.industrie-anhänger), and of course we want them to
 find what they are looking for. I've set it up so that the
 WordDelimiterFilterFactory uses catenateWords=1, so that these words
 are catenated. An analysis of Industrieanhänger as index and
 industrie-anhänger as query can be seen here [2].
 
 You can see that both word parts are found. However, querying for
 industrie-anhänger does not yield results, only when the hyphen is
 removed, as you can see here [3]. I'm not sure how to proceed from here,
 as the results of the analysis have so far always lined up with what I
 could see when querying. Here's the schema definition for text, the
 field type for the name field:
 
 fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
  analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=0 catenateAll=0
 preserveOriginal=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.DictionaryCompoundWordTokenFilterFactory
 dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
 maxSubwordSize=30 onlyLongestMatch=false/
  filter class=solr.StopFilterFactory words=stopwords.txt
 ignoreCase=true enablePositionIncrements=true format=snowball/
  filter class=solr.GermanNormalizationFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=German2
 protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=0 catenateAll=0
 preserveOriginal=1/
  filter class=solr.LowerCaseFilterFactory/
  !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory
 dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
 maxSubwordSize=30 onlyLongestMatch=false/ --
  filter class=solr.StopFilterFactory words=stopwords.txt
 ignoreCase=true enablePositionIncrements=true format=snowball/
  filter class=solr.GermanNormalizationFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=German2
 protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
 /fieldType
 
 I've also thought it might be a problem with URL encoding not encoding
 the hyphen, but replacing it with %2D didn't change the outcome (and was
 probably wrong anyway).
 
 Any help is greatly appreciated. 
 
 Links:
 --
 [1] http://imgur.com/2oEC5vz
 [2] http://i.imgur.com/H0AhEsF.png
 [3] http://imgur.com/dzmMe7t

Re: DocTransformers for restructuring output, e.g. Highlighting

2015-06-11 Thread Ahmet Arslan

Hi Upayavira,

I was going to suggest SOLR-3479 to Edwin, I saw your old post.

Regarding your suggestion, there is an existing ticket : 
https://issues.apache.org/jira/browse/SOLR-3479

I think SOLR-7665 is also relevant to your question.

Ahmet
 



On Sunday, June 23, 2013 9:54 PM, Upayavira u...@odoko.co.uk wrote:
I've just taken a peek at the src for DocTransformers. They get given a
TransformContext. That context contains the query and a few other bits
and pieces.

If it contained the response, DocTransformers would be able to do output
restructuring. The best example is hit highlighting. If you did:

hl=onhl.fl=namefl=*,[highlight:name]

you would no longer need to seek the highlighted strings in another part
of the output.

The conceptual downside of this approach is that we might expect the
highlighting to be done inside the DocTransfomer not a search component,
i.e. not needing the hl=onhl.fl=name bit. That is, this would be a
great change for existing Solr users, but might be confusing for new
Solr users.

I did try to move the highlighting code itself into the DocTransformer,
but stalled at the point at which it needed to be CoreAware, as
DocTransformers aren't allowed to be. Without that, it isn't possible to
access the Highlighter components in the core's configuration.

Thoughts? Is this a useful feature?

Upayavira

Problem with german hyphenated words not being found

2015-06-11 Thread Thomas Michael Engelke

 Hey,

in german, you can string most nouns together by using hyphens, like
this:

Industrie = industry
Anhänger = trailer

Industrie-Anhänger = trailer for industrial use

Here [1], you can see me querying Industrieanhänger from the name
field (name:Industrieanhänger), to make sure the index actually contains
the word. Our data is structured that products are listed without the
hyphen.

Now, customers can come around and use the hyphenated version as a
search term (i.e.industrie-anhänger), and of course we want them to
find what they are looking for. I've set it up so that the
WordDelimiterFilterFactory uses catenateWords=1, so that these words
are catenated. An analysis of Industrieanhänger as index and
industrie-anhänger as query can be seen here [2].

You can see that both word parts are found. However, querying for
industrie-anhänger does not yield results, only when the hyphen is
removed, as you can see here [3]. I'm not sure how to proceed from here,
as the results of the analysis have so far always lined up with what I
could see when querying. Here's the schema definition for text, the
field type for the name field:

fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/ --
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/fieldType

I've also thought it might be a problem with URL encoding not encoding
the hyphen, but replacing it with %2D didn't change the outcome (and was
probably wrong anyway).

Any help is greatly appreciated. 

Links:
--
[1] http://imgur.com/2oEC5vz
[2] http://i.imgur.com/H0AhEsF.png
[3] http://imgur.com/dzmMe7t

Re: Indexing issue - index get deleted

2015-06-11 Thread Alessandro Benedetti

Hi Chris,
Amazing Analysis !
I did actually not investigated the log, because I was first trying to get
more information from the user.
We are running full import and delta import crons .

Fulll index once a day

delta index : every 10 mins


last night my index automatically deleted(numdocs=0).

attaching logs for review .

Reading better the user initial mail , he does a full import as well ( and
at this point, cleaning the Index) .
Not sure is there any practical reason to do that, the user will clarify
that to us.

So after the clean happened, something prevented the full import to
proceed, and we had the weird behaviour monitored in the logs.

Really curious of understanding this better :)


2015-06-11 1:36 GMT+01:00 Chris Hostetter hossman_luc...@fucit.org:


 : The guys was using delta import anyway, so maybe the problem is
 : different and not related to the clean.

 that's not what the logs say.

 Here's what i see...

 Log begins with server startup @ Jun 10, 2015 11:14:56 AM

 The DeletionPolicy for the shopclue_prod core is initialized at Jun
 10, 2015 11:15:04 AM and we see a few interesting things here we note
 for the future as we keep reading...

 1) There is currently commits:num=1 commits on disk
 2) the current index dir in use is index.20150311161021822
 3) the current segment  generation are segFN=segments_1a,generation=46

 Immediately after this, we see some searcher warming using a searcher with
 this same segments file, and then this searcher is registered (Jun 10,
 2015 11:15:05 AM) and the core is registered.

 Next we see some replication polling, and we see what look like some
 simple monitoring requests for q=* which return hits=85898 being
 repeated over and over.

 At Jun 10, 2015 11:16:30 AM we see some requests for /dataimport that
 look like they are coming from the UI. and then at Jun 10, 2015 11:17:01
 AM we see a request for a full import started.

 We have no idea what the data import configuration file looks like, so we
 have no idea if clean=false is being used or not.  it's certianly not
 specified in the URL.

 We see some more monitoring URLs returning hits=85898 and some more
 /repliation status calls, and then @ Jun 10, 2015 11:18:02 AM we see the
 first commit executed since hte server started up.

 there's no indication that this commit came from an external request (eg
 /update) so probably was made by some internal request.  One
 possiblility is that it came from DIH finishing -- but i doubt it, i'm
 fairly sure that would have involved more logging then this.  A more
 probably scenerio is that it came from an autoCommit setting -- the fact
 that it is almost exactly 60 seconds after DIH started -- and almost
 exactly 60 seconds after DIH may have done a deleteAll query due to
 clean=true -- makes it seem very likely that this was a 1 minute
 autoCommit)

 (but since we don't have either hte data import config, or the
 solrconfig.xml, we have no way of knowing -- it's all just guess work.)

 Very importantly, note that this commit is not opening a new searcher...

 Jun 10, 2015 11:18:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

 Here are some other interesting things to note from the logging
 that comes from the DeletionPolicy when this commit happens...

 1) it now notes that there are commits:num=2 on disk
 2) the current index dir hasn't changed (index.20150311161021822) so
 some weird replication command didn't swap the world out from under us
 3) the newest segment/generation are segFN=segments_1b,generation=47
 4) the newest commit has no other files in it besides the segments file.

 this means, with out a doubt, there are no documents in this commits view
 of the index.  they have all been deleted by something.


 At this point the *old* searcher (for commit generation 46) is still in
 use however -- nothing has done an openSearcher=true.

 we see more /dataimport status requests, and other requests that appear to
 come from the Solr UI, and more monitoring queries that still return
 hits=85898 because the same searcher is in use.

 At Jun 10, 2015 11:27:04 AM we see another commit happen -- again, no
 indication that this came from an outside /update request, so it might be
 from DIH, or it might be from an autoCommit setting.  the fact that it is
 nearly exactly 10 minutes after DIH started (and probably did a clean=true
 deleteAll query) makes it seem extremely likely this is an autoSoftCommit
 setting kicking in.

 Very importantly, note that this softCommit *does* open a new searcher...

 Jun 10, 2015 11:27:04 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start

 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}


 In less then a second, this new searcher is warmed up and the next time we
 see a q=* monitoring query get

Re: Adding applicative cache to SolrSearcher

2015-06-11 Thread adfel70

Works great, thanks guys!
Missed the leafReader because I looked at IndexSearcher instead of
SolrIndexSearcher...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Adding-applicative-cache-to-SolrSearcher-tp4211012p4211183.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Charlie Hull


On 11/06/2015 14:57, Paden wrote:

So you're saying that Tika can parse the text OUTSIDE of Solr. So I would
still be able to process my PDF's with Tika just outside of Solr
specifically correct?


Yes.

Charlie





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211172.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Paden

You were very VERY helpful. Thank you very much. If I could bug you for one
last question. Do you know where the documentation is that would help me
write my own indexer? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html
Sent from the Solr - User mailing list archive at Nabble.com.

DocValues memory consumption thoughts

2015-06-11 Thread adfel70

I am using DocValues and I am wondering how to configure Solr's processes
java's heap size: does DocValues uses system cache (off heap memory) or heap
memory? should I take  DocValues into consideration when I calculate heap
parameters (xmx, xmn, xms...)?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DocValues-memory-consumption-thoughts-tp4211187.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Alessandro Benedetti

I agree with all the ideas so far explained, but actually I would have
suggested the DIH ( Data Import Handler) as a first plan.

It does already allow out of the box indexing from different datasources.
It supports Jdbc datasources with extensive processors and it does support
also a file system datasource with the possibility of using the
TikaEntityProcessor.

So actually the requirement of the user can be reached directly with a
single configuration of the DIH and a proper schema design.

Of course if the situation gets more complicated there will be the
necessity of customising some DIH component or proceeding writing a custom
Indexer.

Cheers

2015-06-11 16:20 GMT+01:00 Erick Erickson erickerick...@gmail.com:

Here's a skeleton that uses Tika from a SolrJ client. It mixes in
a database too, but the parts are pretty separate.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Thu, Jun 11, 2015 at 7:14 AM, Paden rumsey...@gmail.com wrote:
You were very VERY helpful. Thank you very much. If I could bug you for
one
last question. Do you know where the documentation is that would help me
write my own indexer?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Exact phrase search not working

2015-06-11 Thread Mike Thomsen

This is my field definition:

fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter
class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
collection=default-collection includeTokens=true
replaceWhitespaceWith=_/
filter class=solr.StopFilterFactory
ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.HunspellStemFilterFactory
dictionary=en_US.dic
affix=en_US.aff
ignoreCase=false
longestOnly=false /
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter
class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
collection=default-collection includeTokens=true
replaceWhitespaceWith=_/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ManagedSynonymFilterFactory managed=english
/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.HunspellStemFilterFactory
dictionary=en_US.dic
affix=en_US.aff
ignoreCase=false
longestOnly=false /
/analyzer
/fieldType

Then I query for this exact phrase (which I can see in various documents)
and get no results...

my_field: baltimore police force

This is the output of the debugQuery part of the result set.

rawquerystring: \baltimore police force\,
querystring: \baltimore police force\,
parsedquery: PhraseQuery(search_text:\baltimore ? police ? ? force\),
parsedquery_toString: search_text:\baltimore ? police ? ? force\,
QParser: LuceneQParser,

Thanks,

Mike

RE: The best way to exclude seen results from search queries

2015-06-11 Thread amid

Thanks allot Charles,

This seems to be what I'm looking for.
Do you know if join for this amount of documents  user will still have good
query performance? also, is there any limitations for the solr architecture
once using the join method (i.e. sharding)?

Many thanks,
Ami



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211223.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The best way to exclude seen results from search queries

2015-06-11 Thread Reitzel, Charles

So long as the fields are indexed, I think performance should be ok.

Personally, I would also look at using a single document per user with a 
multi-valued field for recommendation ID.   Assuming only a small fraction of 
all recommendation IDs are ever presented to any single user, this schema would 
be physically much smaller and require only a single document per user.

I don't know the answer to your sharding question.   The join query is 
available out of the box, so it should be quick work to set up a two-shard 
sample and test the distributed sub-query.

That said, with the scales you are talking about, I question if sharding is 
necessary.   You can still use replication for load balancing without sharding.

-Original Message-
From: amid [mailto:a...@donanza.com] 
Sent: Thursday, June 11, 2015 12:36 PM
To: solr-user@lucene.apache.org
Subject: RE: The best way to exclude seen results from search queries

Thanks allot Charles,

This seems to be what I'm looking for.
Do you know if join for this amount of documents  user will still have good 
query performance? also, is there any limitations for the solr architecture 
once using the join method (i.e. sharding)?

Many thanks,
Ami



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211223.html
Sent from the Solr - User mailing list archive at Nabble.com.

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*

RE: Show all fields in Solr highlighting output

2015-06-11 Thread Reitzel, Charles

Moving the highlighted snippets to the main response is a bad thing for some 
applications.  E.g. if you do any sorting or searching on the returned fields, 
you need to use the original values.   The same is true if any of the values 
are used as a key into some other system or table lookup.   Specifically, the 
insertion of markup into the text changes values that affect sorting and 
matching.

Thus the wisdom of the current design that returns highlighting results 
separately.

Of course, it is very simple to merge the highlighting results into the 
returned documents.   The highlighting results have been thoughtfully arranged 
as a lookup table using the unique ID field as the key.   In SolrJ, this is a 
Map.   Thus, you can loop over the result documents, lookup the highlight 
results for that document and overwrite the original value with the highlighted 
value.   Be sure to set your snippet size bigger than the largest value you 
expect!

Anyway, this type of thing is better handled by the application than Solr, per 
se.

static int nDocs( QueryResponse response ) {
int nReturned = 0;
if ( null != response  null != response.getResults() ) {
nReturned = response.getResults().size();
}
return nReturned;
}

static boolean hasHighlight( QueryResponse response ) {
boolean hasHL = false;
if ( null != response  null != response.getHighlighting() ) {
hasHL = response.getHighlighting().size()  0;
}
return hasHL;
}

protected void mergeHighlightResults( QueryResponse response, String 
uniqueIdField )
{
if ( nDocs(response)  0  hasHighlight(response) )
{
for ( SolrDocument result : response.getResults() )
{
MapString, ListString hlDoc
 = response.getHighlighting().get( 
result.getFirstValue(uniqueIdField) );
if ( null != hlDoc  hlDoc.size()  0 ) {
for ( String fieldName : hlDoc.keySet() ) 
{
ListString hlValues = hlDoc.get( 
fieldName );
// This is the only tricky bit: this 
logic may not work all that well for multi-valued fields.
// You cannot reliably match the 
altered values to an original value.  So, if any HL values
// are returned, just replace all 
values with HL values.
// This will not work 100% of the time.

int ix = 0;
for ( String hlVal : hlValues ) {
if ( 0 == ix++ ) {
result.setField( 
fieldName, hlVal );
}
else {
result.addField( 
fieldName, hlVal );
}
}
}
}
}
}
}

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Thursday, June 11, 2015 6:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Show all fields in Solr highlighting output

Hi Edwin,

I think Highlighting Behaviour of those types shifts over time. May be we 
should do the reverse. 
Move snippets to main response: https://issues.apache.org/jira/browse/SOLR-3479

Ahmet



On Thursday, June 11, 2015 11:23 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com 
wrote:
Hi Ahmet,

I've tried that, but it's still not able to show.

Those fields are actually of type=float, type=date and type=int.

By default those field type are not able to be highlighted?

Regards,
Edwin




On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote:

 Hi Edwin,

 hl.alternateField is probably what you are looking for.

 ahmet




 On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo  
 edwinye...@gmail.com wrote:
 Hi,

 Is it possible to list all the fields in the highlighting portion in 
 the output?
 Currently,even when I str name=hl.fl*/str, it only shows fields 
 where highlighting is possible, and fields which highlighting is not 
 possible is not shown.

 I would like to have the output where all the fields, regardless if 
 highlighting is possible or not, to be shown together.


 Regards,
 Edwin


*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF

How to index/search without whitespace but hightlight with whitespace?

2015-06-11 Thread Travis

Hey everyone!

I'm trying to setup a Solr instance on some free text clinical data.
This data has a lot of white space formatting, for example, I might have a
document that contains unstructured bulleted lists or section titles.

For example,

blah blah blah...
MEDICATIONS:
* Xanax
* Phenobritrol

DIAGNOSIS:
blah blah blah...

When indexing (and thus querying) this document, I use a text field with
tokenization, stemming, etc, lets call it text.

Unfortunately, when I try to print highlighted results, the newlines and
whitespace are obviously not preserved. In an attempt to get around this, I
created a second field in the index that stores the full content of each
document as a string, thus preserving the whitespace, called raw_text.

If I setup the search page to search on the text field, but highlight on
the text_raw field, then the highlighted matches don't always line up. Is
there a way to some how project the stemmed matches from the text field
onto the text_raw field when displaying hightlighting?

Thank you for your time,
Travis

RE: The best way to exclude seen results from search queries

2015-06-11 Thread amid

Thanks Charles,

We though of using multi-valued field but got the feeling it will not be
small as our data will grow.
Another issue with multi-valued field is that you can't create complex join
query, while using a different collection with document with more than one
field (e.g. recommendation_date) can help us easily delete/limit the amount
of time this recommendation will not be shown again.

Thanks for your answer, seems like replication  load balancing will be good
enough for now :)

Thanks allot, Ami



--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211239.html
Sent from the Solr - User mailing list archive at Nabble.com.

Lucene/Solr Revolution 2015 Voting

2015-06-11 Thread Yonik Seeley

Hey Folks,

If you're interested in going to Lucene/Solr Revolution this year in Austin,
please vote for the sessions you would like to see!

https://lucenerevolution.uservoice.com/

-Yonik

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Paden

I do have a link between both sets of data and that would be the filepath
that could be indexed by both. 

I do, however, have large PDF's that do need to be indexed. So just for
clarification, I could write an indexer that used both the DIH and SolrCell
to submit a combined record to Solr or would there be a different process if
I used these methods instead? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211169.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Charlie Hull


On 11/06/2015 14:38, Paden wrote:

I do have a link between both sets of data and that would be the filepath
that could be indexed by both.


Great.


I do, however, have large PDF's that do need to be indexed. So just for
clarification, I could write an indexer that used both the DIH and SolrCell
to submit a combined record to Solr or would there be a different process if
I used these methods instead?


No, I'm suggesting you write an indexer that doesn't use either DIH or 
SolrCell.


Charlie





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211169.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Separate network interfaces for inter-node communication and update/search requests?

2015-06-11 Thread Shawn Heisey

On 6/11/2015 6:47 AM, MOIS Martin (MORPHO) wrote:
 is it possible to separate the network interface for inter-node communication 
 from the network interface for update/search requests? If so I could put two 
 network cards in each machine and route the index and search traffic over the 
 first interface and the traffic for the inter-node communication (sending 
 documents to replicas) over the second interface.

Assuming you are using SolrCloud, you would do this by using the name or
IP address of the internal communication interface on the host
parameter in your solr.xml file (or -Dhost=foo on the startup
commandline).  This will cause each node to register itself with
zookeeper using that interface.

Note that what I've said above probably will not work with a cloud-aware
client like CloudSolrClient/CloudSolrServer in SolrJ, because that
client will obtain the server/port for each node from zookeeper and try
to contact each one directly.  The necessary routing probably will not
be in place.

If it's not SolrCloud, then the shards parameter that you are using for
distributed search would need internal names/addresses.

The other interface, for queries and updates, would be the one with the
default gateway.

Thanks,
Shawn

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Charlie Hull

On 11/06/2015 14:19, Paden wrote:

I'm trying to figure out if Solr is a good fit for my project.

I have two sets of data. On the one hand there is a bunch of files sitting
in a local file system in a Linux file system. On the other is a set of
metadata FOR the files that is located in a MySQL database.

I need a program that can merge BOTH sets of data into one index. Meaning
that the metadata in the database will attach/merge with the file data(the
text) from the file system to create one searchable indexed item for each
document in the file system. The metadata located in the database contains
information that is vital to a faceted search of the documents located in
the file system.

Would Solr accomplish my goals? And if so, what tools can it provide to do
so?

If you can link the files and the metadata easily, then this shouldn't
be hard (i.e. you have some common identifier). We would write an
indexer in Python that extracted data from MySQL, crawled the filesystem
and used Apache Tika to extract plain text from the files, then
submitted a combined record to Solr for indexing. You'll need to decide
on a schema for the combined record of course.

There are alternatives (DataImportHandler for the database, SolrCell for
submitting the files directly) but we prefer to keep the file handling
in particular outside of Solr (as large PDFs for example can kill Tika
and thus Solr itself).

Cheers

Charlie

--
View this message in context:
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk

Separate network interfaces for inter-node communication and update/search requests?

2015-06-11 Thread MOIS Martin (MORPHO)

Hello,

is it possible to separate the network interface for inter-node communication 
from the network interface for update/search requests? If so I could put two 
network cards in each machine and route the index and search traffic over the 
first interface and the traffic for the inter-node communication (sending 
documents to replicas) over the second interface.

Best Regards,
Martin Mois
#
 This e-mail and any attached documents may contain confidential or 
proprietary information. If you are not the intended recipient, you are 
notified that any dissemination, copying of this e-mail and any attachments 
thereto or use of their contents by any means whatsoever is strictly 
prohibited. If you have received this e-mail in error, please advise the sender 
immediately and delete this e-mail and all attached documents from your 
computer system.
#

Merging Sets of Data from Two Different Sources

2015-06-11 Thread Paden

I'm trying to figure out if Solr is a good fit for my project. 

I have two sets of data. On the one hand there is a bunch of files sitting
in a local file system in a Linux file system. On the other is a set of
metadata FOR the files that is located in a MySQL database. 

I need a program that can merge BOTH sets of data into one index. Meaning
that the metadata in the database will attach/merge with the file data(the
text) from the file system to create one searchable indexed item for each
document in the file system. The metadata located in the database contains
information that is vital to a faceted search of the documents located in
the file system. 

Would Solr accomplish my goals? And if so, what tools can it provide to do
so? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Paden

So you're saying that Tika can parse the text OUTSIDE of Solr. So I would
still be able to process my PDF's with Tika just outside of Solr
specifically correct? 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211172.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem with german hyphenated words not being found

2015-06-11 Thread Thomas Michael Engelke

 Thank you for your input. Here's how the query looks with
debugQuery=true:

rawquerystring: name:industrie-anhänger,
 querystring: name:industrie-anhänger,
 parsedquery: MultiPhraseQuery(name:(industrie-anhang industri)
(anhang industrieanhang)),
 parsedquery_toString: name:(industrie-anhang industri) (anhang
industrieanhang),

 It looks like there are some rules applied, expressed by the braces.
What's the correct interpretation of that? The default operator is OR,
yet this looks like the terms inside the braces group using AND.

Am 11.06.2015 12:40 schrieb Upayavira: 

 The next thing to do is add debugQuery=true to your URL (or enable it in
 the query pane of the admin UI). Then look for the parsed query info.
 
 On the standard text_en field which includes an English stop word
 filter, I ran a query on Jack and Jill's House which showed
 this output:
 
 rawquerystring: text_en:(Jack and Jill's House), querystring:
 text_en:(Jack and Jill's House), parsedquery: text_en:jack
 text_en:jill text_en:hous, parsedquery_toString: text_en:jack
 text_en:jill text_en:hous,
 
 You can see that the parsed query is formed *after* analysis, so you can
 see exactly what is being queried for.
 
 Also, as a corollary to this, you can use the schema browser (or
 faceting for that matter) to view what terms are being indexed, to see
 if they should match.
 
 HTH
 
 Upayavira
 
 Am 11.06.2015 12:00 schrieb Upayavira:
 Have you used the analysis tab in the admin UI? You can type in

sentences for both index and query time and see how they would be
analysed by various fields/field types.

Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:

 Hey, in german, you can string most nouns together by using hyphens, like 
 this: Industrie = industry Anhänger = trailer Industrie- Anhänger = trailer 
 for industrial use Here [1[1]], you can see me querying Industrieanhänger 
 from the name field (name:Industrieanhänger), to make sure the index 
 actually contains the word. Our data is structured that products are listed 
 without the hyphen. Now, customers can come around and use the hyphenated 
 version as a search term (i.e.industrie-anhänger), and of course we want 
 them to find what they are looking for. I've set it up so that the 
 WordDelimiterFilterFactory uses catenateWords=1, so that these words are 
 catenated. An analysis of Industrieanhänger as index and 
 industrie-anhänger as query can be seen here [2[2]]. You can see that both 
 word parts are found. However, querying for industrie- anhänger does not 
 yield results, only when the hyphen is removed, as you can see here [3[3]]. 
 I'm not sure how to proceed from
here, as the results of the analysis have so far always lined up with what I 
could see when querying. Here's the schema definition for text, the field 
type for the name field: fieldType name=text class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer 
type=index tokenizer class=solr.StandardTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ 
filter class=solr.LowerCaseFilterFactory/ filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter 
class=solr.GermanNormalizationFilterFactory/ filter
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter 
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer 
type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ 
filter class=solr.LowerCaseFilterFactory/ !-- filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ -- filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter 
class=solr.GermanNormalizationFilterFactory/ filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've 
also thought it might be a problem with URL encoding not encoding the hyphen, 
but replacing it with %2D didn't change the outcome (and was probably wrong 
anyway). Any help is greatly appreciated. Links: -- [1] 
http://imgur.com/2oEC5vz [1] [2]

Re: Phrase Highlighter + Surround Query Parser

2015-06-11 Thread Salman Akram

Picking up this thread again...

When you said 'stock one' you meant in built surround Query parser of
customized? We already use usePhrasehighlighter=true.


On Mon, Aug 4, 2014 at 10:38 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 You are using a customized surround query parser, right?

 Did you check/try with the stock one? I recall correctly
 usePhrasehighlighter=true was working in the past for surround.

 Ahmet



 On Monday, August 4, 2014 8:25 AM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:
 Anyone?


 On Fri, Aug 1, 2014 at 12:31 PM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

  We are having an issue in Phrase highlighter with Surround Query Parser
  e.g. *first thing w/100 you must *brings correct results but also
  highlights individual words of the phrase - first, thing are highlighted
  where they come separately as well.
 
  Any idea how this can be fixed?
 
 
  --
  Regards,
 
  Salman Akram



 
 


 --
 Regards,

 Salman Akram




-- 
Regards,

Salman Akram

Re: Separate network interfaces for inter-node communication and update/search requests?

2015-06-11 Thread Anirudha Jadhav

Modern network interfaces are pretty capable. I would doubt this
optimization would yield any performance improvements.
I would love to see some test results which prove me wrong.

is performance the primary reason for this? or do you have any other
reasons.

-Ani

On Thu, Jun 11, 2015 at 9:04 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 6/11/2015 6:47 AM, MOIS Martin (MORPHO) wrote:
  is it possible to separate the network interface for inter-node
 communication from the network interface for update/search requests? If so
 I could put two network cards in each machine and route the index and
 search traffic over the first interface and the traffic for the inter-node
 communication (sending documents to replicas) over the second interface.

 Assuming you are using SolrCloud, you would do this by using the name or
 IP address of the internal communication interface on the host
 parameter in your solr.xml file (or -Dhost=foo on the startup
 commandline).  This will cause each node to register itself with
 zookeeper using that interface.

 Note that what I've said above probably will not work with a cloud-aware
 client like CloudSolrClient/CloudSolrServer in SolrJ, because that
 client will obtain the server/port for each node from zookeeper and try
 to contact each one directly.  The necessary routing probably will not
 be in place.

 If it's not SolrCloud, then the shards parameter that you are using for
 distributed search would need internal names/addresses.

 The other interface, for queries and updates, would be the one with the
 default gateway.

 Thanks,
 Shawn




-- 
Anirudha P. Jadhav

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Paden

The filepath is the key in both the filesystem and the database



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211253.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Paden

Both sources, the filesystem and the database, contain the file path for each
individual file



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211251.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Jack Krupansky

One question is which source defines the key - do you crawl the files and
then look up the file name in the database, or scan the database and there
is a field to specify the file name? IOW, given a database key, is there a
fixed method to determine the file name path? And vice versa.

-- Jack Krupansky

On Thu, Jun 11, 2015 at 11:48 AM, Alessandro Benedetti
benedetti.ale...@gmail.com wrote:

I agree with all the ideas so far explained, but actually I would have
suggested the DIH ( Data Import Handler) as a first plan.

So actually the requirement of the user can be reached directly with a
single configuration of the DIH and a proper schema design.

Of course if the situation gets more complicated there will be the
necessity of customising some DIH component or proceeding writing a custom
Indexer.

Cheers

2015-06-11 16:20 GMT+01:00 Erick Erickson erickerick...@gmail.com:

Here's a skeleton that uses Tika from a SolrJ client. It mixes in
a database too, but the parts are pretty separate.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

--
View this message in context:

http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Paden

So you're saying I could merge both the metadata in the database and their
files in the file system into one  query-able item in solr by just
customizing the DIH correctly and getting the right schema? 

(I'm sorry this sounds like a redundant question but I've been trying to
find an answer for the past couple of days and it seems like people
sometimes misunderstand what I'm asking) 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211248.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Reitzel, Charles

Yes.  Typically, the content file is used to populate a single field in each 
document, e.g. content.  Typically, this field is the primary target for 
searches.Sometimes, additional metadata (title, author, etc.) can be 
extracted from the source files.   But the idea remains the same: the two 
sources (database record + file) are merged into single searchable document in 
solr.

If you write your own indexer using SolrJ, you have more control the loading 
process and, imo, the approach is clearer.  All the pieces come together in one 
place.

But Alessandro says the same result is achievable using DataImportHandler.   
Probably worth a try before writing code...

-Original Message-
From: Paden [mailto:rumsey...@gmail.com] 
Sent: Thursday, June 11, 2015 4:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Merging Sets of Data from Two Different Sources

So you're saying I could merge both the metadata in the database and their 
files in the file system into one  query-able item in solr by just customizing 
the DIH correctly and getting the right schema? 

(I'm sorry this sounds like a redundant question but I've been trying to find 
an answer for the past couple of days and it seems like people sometimes 
misunderstand what I'm asking) 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211248.html
Sent from the Solr - User mailing list archive at Nabble.com.

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*

Re: Index optimize runs in background.

2015-06-11 Thread Walter Underwood

Why would you care when the forced merge (not an “optimize”) is done? Start it
and get back to work.

Or even better, never force merge and let the algorithm take care of it.
Seriously, I’ve been giving this advice since before Lucene was written,
because Ultraseek had the same approach for managing index segments.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)

On Jun 10, 2015, at 10:35 PM, Erick Erickson erickerick...@gmail.com wrote:

If I knew, I would fix it ;). The sub-optimizes (i.e. the ones
sent out to each replica) should be sent in parallel and then
each thread should wait for completion from the replicas. There
is no real check for optimize, I believe that the return from the
call is considered sufficient. If we can track down if there are
conditions under which this is not true we can fix it.

But until there's a way to reproduce it, it's pretty much speculation.

Best,
Erick

On Wed, Jun 10, 2015 at 10:14 PM, Modassar Ather modather1...@gmail.com
wrote:
Hi,

There are 5 cores and a separate server for indexing on this solrcloud. Can
you please share your suggestions on:
How can indexer know that the optimize has completed even if the
commit/optimize runs in background without going to the solr servers may be
by using any solrj or other API?

I tried but could not find any API/handler to check if the optimizations is
completed. Kindly share your inputs.

Thanks,
Modassar

On Thu, Jun 4, 2015 at 9:36 PM, Erick Erickson erickerick...@gmail.com
wrote:

Can't get any failures to happen on my end so I really haven't a clue.

Best,
Erick

On Thu, Jun 4, 2015 at 3:17 AM, Modassar Ather modather1...@gmail.com
wrote:
Hi,

Please provide your inputs on optimize and commit running as background.
Your suggestion will be really helpful.

Thanks,
Modassar

On Tue, Jun 2, 2015 at 6:05 PM, Modassar Ather modather1...@gmail.com
wrote:

Erick! I could not find any underlying setting of 10 minutes.
It is not only optimize but commit is also behaving in the same fashion
and is taking lesser time than usually had taken.
As per my observation both are running in background.

On Fri, May 29, 2015 at 7:21 PM, Erick Erickson
erickerick...@gmail.com
wrote:

I'm not talking about you setting a timeout, but the underlying
connection timing out...

The 10 minutes then the indexer exits comment points in that
direction.

Best,
Erick

On Thu, May 28, 2015 at 11:43 PM, Modassar Ather
modather1...@gmail.com
wrote:
I have not added any timeout in the indexer except zk client time out
which
is 30 seconds. I am simply calling client.close() at the end of
indexing.
The same code was not running in background for optimize with
solr-4.10.3
and org.apache.solr.client.solrj.impl.CloudSolrServer.

On Fri, May 29, 2015 at 11:13 AM, Erick Erickson
erickerick...@gmail.com
wrote:

Are you timing out on the client request? The theory here is that
it's
still a synchronous call, but you're just timing out at the client
level. At that point, the optimize is still running it's just the
connection has been dropped

Shot in the dark.
Erick

On Thu, May 28, 2015 at 10:31 PM, Modassar Ather
modather1...@gmail.com
wrote:
I could not notice it but with my past experience of commit which
used to
take around 2 minutes is now taking around 8 seconds. I think
this is
also
running as background.

On Fri, May 29, 2015 at 10:52 AM, Modassar Ather
modather1...@gmail.com

wrote:

The indexer takes almost 2 hours to optimize. It has a
multi-threaded
add
of batches of documents to
org.apache.solr.client.solrj.impl.CloudSolrClient.
Once all the documents are indexed it invokes commit and
optimize. I
have
seen that the optimize goes into background after 10 minutes and
indexer
exits.
I am not sure why this 10 minutes it hangs on indexer. This
behavior I
have seen in multiple iteration of the indexing of same data.

There is nothing significant I found in log which I can share. I
can see
following in log.
org.apache.solr.update.DirectUpdateHandler2; start

commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

On Wed, May 27, 2015 at 10:59 PM, Erick Erickson
erickerick...@gmail.com
wrote:

All strange of course. What do your Solr logs show when this
happens?
And how reproducible is this?

Best,
Erick

On Wed, May 27, 2015 at 4:00 AM, Upayavira u...@odoko.co.uk
wrote:
In this case, optimising makes sense, once the index is
generated,
you
are not updating It.

Upayavira

On Wed, May 27, 2015, at 06:14 AM, Modassar Ather wrote:
Our index has almost 100M documents running on SolrCloud of 5
shards
and
each shard has an index size of about 170+GB (for the record,
we are
not
using stored fields - our documents are pretty large). We
perform a
full
indexing every weekend and during the week

Re: Indexing issue - index get deleted

2015-06-11 Thread Midas A

Thanks . for replying ..

please find the data-config



On Thu, Jun 11, 2015 at 6:06 AM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : The guys was using delta import anyway, so maybe the problem is
 : different and not related to the clean.

 that's not what the logs say.

 Here's what i see...

 Log begins with server startup @ Jun 10, 2015 11:14:56 AM

 The DeletionPolicy for the shopclue_prod core is initialized at Jun
 10, 2015 11:15:04 AM and we see a few interesting things here we note
 for the future as we keep reading...

 1) There is currently commits:num=1 commits on disk
 2) the current index dir in use is index.20150311161021822
 3) the current segment  generation are segFN=segments_1a,generation=46

 Immediately after this, we see some searcher warming using a searcher with
 this same segments file, and then this searcher is registered (Jun 10,
 2015 11:15:05 AM) and the core is registered.

 Next we see some replication polling, and we see what look like some
 simple monitoring requests for q=* which return hits=85898 being
 repeated over and over.

 At Jun 10, 2015 11:16:30 AM we see some requests for /dataimport that
 look like they are coming from the UI. and then at Jun 10, 2015 11:17:01
 AM we see a request for a full import started.

 We have no idea what the data import configuration file looks like, so we
 have no idea if clean=false is being used or not.  it's certianly not
 specified in the URL.

 We see some more monitoring URLs returning hits=85898 and some more
 /repliation status calls, and then @ Jun 10, 2015 11:18:02 AM we see the
 first commit executed since hte server started up.

 there's no indication that this commit came from an external request (eg
 /update) so probably was made by some internal request.  One
 possiblility is that it came from DIH finishing -- but i doubt it, i'm
 fairly sure that would have involved more logging then this.  A more
 probably scenerio is that it came from an autoCommit setting -- the fact
 that it is almost exactly 60 seconds after DIH started -- and almost
 exactly 60 seconds after DIH may have done a deleteAll query due to
 clean=true -- makes it seem very likely that this was a 1 minute
 autoCommit)

 (but since we don't have either hte data import config, or the
 solrconfig.xml, we have no way of knowing -- it's all just guess work.)

 Very importantly, note that this commit is not opening a new searcher...

 Jun 10, 2015 11:18:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

 Here are some other interesting things to note from the logging
 that comes from the DeletionPolicy when this commit happens...

 1) it now notes that there are commits:num=2 on disk
 2) the current index dir hasn't changed (index.20150311161021822) so
 some weird replication command didn't swap the world out from under us
 3) the newest segment/generation are segFN=segments_1b,generation=47
 4) the newest commit has no other files in it besides the segments file.

 this means, with out a doubt, there are no documents in this commits view
 of the index.  they have all been deleted by something.


 At this point the *old* searcher (for commit generation 46) is still in
 use however -- nothing has done an openSearcher=true.

 we see more /dataimport status requests, and other requests that appear to
 come from the Solr UI, and more monitoring queries that still return
 hits=85898 because the same searcher is in use.

 At Jun 10, 2015 11:27:04 AM we see another commit happen -- again, no
 indication that this came from an outside /update request, so it might be
 from DIH, or it might be from an autoCommit setting.  the fact that it is
 nearly exactly 10 minutes after DIH started (and probably did a clean=true
 deleteAll query) makes it seem extremely likely this is an autoSoftCommit
 setting kicking in.

 Very importantly, note that this softCommit *does* open a new searcher...

 Jun 10, 2015 11:27:04 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start

 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}


 In less then a second, this new searcher is warmed up and the next time we
 see a q=* monitoring query get logged, it returns hits=0.

 Note that at no point in the logs, after the DataImporter is started, do
 we see it log anything other then that it has initiated the request to
 MySQL -- we do see some logs starting ~ Jun 10, 2015 11:41:19 AM
 indicating that someone was using the Web UI to look at the dataimport
 handler's status report.  it would be really nice to know what that person
 saw at that point -- because my guess is DIH was still running and was
 staled waiting for MySql, and hadn't even started adding docs to Solr (if
 it had, i'm certian there would have been some log of it).

 So instead, the combination of a

AW: How to assign shard to specifc node?

2015-06-11 Thread MOIS Martin (MORPHO)

Thank you for your quick answer.

The two parameters createNodeSet and createNodeSet.shuffle seem to solve the
problem:

http://localhost:8983/solr/admin/collections?action=CREATEname=mycollectionnumShards=3router.name=implicitshards=shard1,shard2,shard3router.field=shardcreateNodeSet=node1,node2,node3createNodeSet.shuffle=false

Best Regards,
Martin Mois

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:erickerick...@gmail.com]
Gesendet: Mittwoch, 10. Juni 2015 17:45
An: solr-user@lucene.apache.org
Betreff: Re: How to assign shard to specifc node?

Take a look at the collections API CREATE command in more detail here:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1

Admittedly this is 5.2 but you didn't mention what version of Solr you're using.
In particular the createNodeSet and createNodeSet.shuffle parameters.

Best,
Erick

On Wed, Jun 10, 2015 at 8:31 AM, MOIS Martin (MORPHO) martin.m...@morpho.com
wrote:
Hello,

I have a cluster with 3 nodes (node1, node2 and node3). Now I want to create
a new collection with 3 shards using `implicit` routing:

http://localhost:8983/solr/admin/collections?action=CREATEname=mycoll
ectionnumShards=3router.name=implicitshards=shard1,shard2,shard3ro
uter.field=shard

How can I control on which node each shard gets created? The goal is to
create shard1 on node1, shard2 on node2, etc..

The background is that the actual raw data the index is created for should
reside on the same host. That means I have a raw record composed of
different data (documents, images, meta-data, etc.) for which I compute a
Lucene document that gets indexed. In order to reduce network traffic I
want to process the raw record on node1 and insert the resulting Lucene
document into shard1 that resides on node1. If shard1 would reside on node2,
the Lucene document would have to be send from node1 to node2 which causes
for big record sets a lot of inter node communication.

Thanks in advance.

Best Regards,
Martin Mois
#
This e-mail and any attached documents may contain confidential or
proprietary information. If you are not the intended recipient, you are
notified that any dissemination, copying of this e-mail and any attachments
thereto or use of their contents by any means whatsoever is strictly
prohibited. If you have received this e-mail in error, please advise the
sender immediately and delete this e-mail and all attached documents from
your computer system.
#
#
This e-mail and any attached documents may contain confidential or
proprietary information. If you are not the intended recipient, you are
notified that any dissemination, copying of this e-mail and any attachments
thereto or use of their contents by any means whatsoever is strictly
prohibited. If you have received this e-mail in error, please advise the sender
immediately and delete this e-mail and all attached documents from your
computer system.
#

Re: Show all fields in Solr highlighting output

2015-06-11 Thread Ahmet Arslan

Hi Edwin,

hl.alternateField is probably what you are looking for.

ahmet




On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com 
wrote:
Hi,

Is it possible to list all the fields in the highlighting portion in the
output?
Currently,even when I str name=hl.fl*/str, it only shows fields where
highlighting is possible, and fields which highlighting is not possible is
not shown.

I would like to have the output where all the fields, regardless if
highlighting is possible or not, to be shown together.


Regards,
Edwin

Re: Index optimize runs in background.

2015-06-11 Thread Upayavira

Until somewhere around Lucene 3.5, you needed to optimise, because the
merge strategy used wasn't that clever and left lots of deletes in your
largest segment. Around that point, the TieredMergePolicy became the
default. Because its algorithm is much more sophisticated, it took away
the need to optimize in the majority of scenarios. In fact, it
transformed optimizing from being a necessary thing to being a bad
thing in most cases.

So yes, let the algorithm take care of it, so long as you are using the
TieredMergePolicy, which has been the default for over 2 years.

Upayavira

On Thu, Jun 11, 2015, at 07:01 AM, Walter Underwood wrote:
 Why would you care when the forced merge (not an “optimize”) is done?
 Start it and get back to work.
 
 Or even better, never force merge and let the algorithm take care of it.
 Seriously, I’ve been giving this advice since before Lucene was written,
 because Ultraseek had the same approach for managing index segments.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
 
 On Jun 10, 2015, at 10:35 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  If I knew, I would fix it ;). The sub-optimizes (i.e. the ones
  sent out to each replica) should be sent in parallel and then
  each thread should wait for completion from the replicas. There
  is no real check for optimize, I believe that the return from the
  call is considered sufficient. If we can track down if there are
  conditions under which this is not true we can fix it.
  
  But until there's a way to reproduce it, it's pretty much speculation.
  
  Best,
  Erick
  
  On Wed, Jun 10, 2015 at 10:14 PM, Modassar Ather modather1...@gmail.com 
  wrote:
  Hi,
  
  There are 5 cores and a separate server for indexing on this solrcloud. Can
  you please share your suggestions on:
   How can indexer know that the optimize has completed even if the
  commit/optimize runs in background without going to the solr servers may be
  by using any solrj or other API?
  
  I tried but could not find any API/handler to check if the optimizations is
  completed. Kindly share your inputs.
  
  Thanks,
  Modassar
  
  On Thu, Jun 4, 2015 at 9:36 PM, Erick Erickson erickerick...@gmail.com
  wrote:
  
  Can't get any failures to happen on my end so I really haven't a clue.
  
  Best,
  Erick
  
  On Thu, Jun 4, 2015 at 3:17 AM, Modassar Ather modather1...@gmail.com
  wrote:
  Hi,
  
  Please provide your inputs on optimize and commit running as background.
  Your suggestion will be really helpful.
  
  Thanks,
  Modassar
  
  On Tue, Jun 2, 2015 at 6:05 PM, Modassar Ather modather1...@gmail.com
  wrote:
  
  Erick! I could not find any underlying setting of 10 minutes.
  It is not only optimize but commit is also behaving in the same fashion
  and is taking lesser time than usually had taken.
  As per my observation both are running in background.
  
  On Fri, May 29, 2015 at 7:21 PM, Erick Erickson 
  erickerick...@gmail.com
  wrote:
  
  I'm not talking about you setting a timeout, but the underlying
  connection timing out...
  
  The 10 minutes then the indexer exits comment points in that
  direction.
  
  Best,
  Erick
  
  On Thu, May 28, 2015 at 11:43 PM, Modassar Ather 
  modather1...@gmail.com
  wrote:
  I have not added any timeout in the indexer except zk client time out
  which
  is 30 seconds. I am simply calling client.close() at the end of
  indexing.
  The same code was not running in background for optimize with
  solr-4.10.3
  and org.apache.solr.client.solrj.impl.CloudSolrServer.
  
  On Fri, May 29, 2015 at 11:13 AM, Erick Erickson 
  erickerick...@gmail.com
  wrote:
  
  Are you timing out on the client request? The theory here is that
  it's
  still a synchronous call, but you're just timing out at the client
  level. At that point, the optimize is still running it's just the
  connection has been dropped
  
  Shot in the dark.
  Erick
  
  On Thu, May 28, 2015 at 10:31 PM, Modassar Ather 
  modather1...@gmail.com
  wrote:
  I could not notice it but with my past experience of commit which
  used to
  take around 2 minutes is now taking around 8 seconds. I think
  this is
  also
  running as background.
  
  On Fri, May 29, 2015 at 10:52 AM, Modassar Ather 
  modather1...@gmail.com
  
  wrote:
  
  The indexer takes almost 2 hours to optimize. It has a
  multi-threaded
  add
  of batches of documents to
  org.apache.solr.client.solrj.impl.CloudSolrClient.
  Once all the documents are indexed it invokes commit and
  optimize. I
  have
  seen that the optimize goes into background after 10 minutes and
  indexer
  exits.
  I am not sure why this 10 minutes it hangs on indexer. This
  behavior I
  have seen in multiple iteration of the indexing of same data.
  
  There is nothing significant I found in log which I can share. I
  can see
  following in log.
  org.apache.solr.update.DirectUpdateHandler2; start

Increase the suggester len size

2015-06-11 Thread Zheng Lin Edwin Yeo

Hi,

I'm facing some issues with my suggester for the content field.

As my content is indexed from rich text documents which is quite large, I
got the following error when I tried to build the suggester using
/suggesthandler?suggest.build=true

lst name=error
str name=msglen must be = 32767; got 35578/str


Is there anyway to increase the len size to bigger than 32767? I might have
documents that's even bigger in the future.


Regards,
Edwin

Re: Show all fields in Solr highlighting output

2015-06-11 Thread Zheng Lin Edwin Yeo

Thank you for the info, Will try to implement it.

Regards,
Edwin

On 12 June 2015 at 01:32, Reitzel, Charles charles.reit...@tiaa-cref.org
wrote:

 Moving the highlighted snippets to the main response is a bad thing for
 some applications.  E.g. if you do any sorting or searching on the returned
 fields, you need to use the original values.   The same is true if any of
 the values are used as a key into some other system or table lookup.
  Specifically, the insertion of markup into the text changes values that
 affect sorting and matching.

 Thus the wisdom of the current design that returns highlighting results
 separately.

 Of course, it is very simple to merge the highlighting results into the
 returned documents.   The highlighting results have been thoughtfully
 arranged as a lookup table using the unique ID field as the key.   In
 SolrJ, this is a Map.   Thus, you can loop over the result documents,
 lookup the highlight results for that document and overwrite the original
 value with the highlighted value.   Be sure to set your snippet size bigger
 than the largest value you expect!

 Anyway, this type of thing is better handled by the application than Solr,
 per se.

 static int nDocs( QueryResponse response ) {
 int nReturned = 0;
 if ( null != response  null != response.getResults() ) {
 nReturned = response.getResults().size();
 }
 return nReturned;
 }

 static boolean hasHighlight( QueryResponse response ) {
 boolean hasHL = false;
 if ( null != response  null != response.getHighlighting() ) {
 hasHL = response.getHighlighting().size()  0;
 }
 return hasHL;
 }

 protected void mergeHighlightResults( QueryResponse response, String
 uniqueIdField )
 {
 if ( nDocs(response)  0  hasHighlight(response) )
 {
 for ( SolrDocument result : response.getResults() )
 {
 MapString, ListString hlDoc
  = response.getHighlighting().get(
 result.getFirstValue(uniqueIdField) );
 if ( null != hlDoc  hlDoc.size()  0 ) {
 for ( String fieldName : hlDoc.keySet() )
 {
 ListString hlValues = hlDoc.get(
 fieldName );
 // This is the only tricky bit:
 this logic may not work all that well for multi-valued fields.
 // You cannot reliably match the
 altered values to an original value.  So, if any HL values
 // are returned, just replace all
 values with HL values.
 // This will not work 100% of the
 time.

 int ix = 0;
 for ( String hlVal : hlValues ) {
 if ( 0 == ix++ ) {
 result.setField(
 fieldName, hlVal );
 }
 else {
 result.addField(
 fieldName, hlVal );
 }
 }
 }
 }
 }
 }
 }

 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
 Sent: Thursday, June 11, 2015 6:43 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Show all fields in Solr highlighting output

 Hi Edwin,

 I think Highlighting Behaviour of those types shifts over time. May be we
 should do the reverse.
 Move snippets to main response:
 https://issues.apache.org/jira/browse/SOLR-3479

 Ahmet



 On Thursday, June 11, 2015 11:23 AM, Zheng Lin Edwin Yeo 
 edwinye...@gmail.com wrote:
 Hi Ahmet,

 I've tried that, but it's still not able to show.

 Those fields are actually of type=float, type=date and type=int.

 By default those field type are not able to be highlighted?

 Regards,
 Edwin




 On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote:

  Hi Edwin,
 
  hl.alternateField is probably what you are looking for.
 
  ahmet
 
 
 
 
  On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo 
  edwinye...@gmail.com wrote:
  Hi,
 
  Is it possible to list all the fields in the highlighting portion in
  the output?
  Currently,even when I str name=hl.fl*/str, it only shows fields
  where highlighting is possible, and fields which highlighting is not
  possible is not shown.
 
  I would like to have the output where all the fields, regardless if
  highlighting is possible or not, to be shown together.
 
 
  Regards,
  Edwin
 

 *
 This e-mail may contain

Re: Show all fields in Solr highlighting output

2015-06-11 Thread Zheng Lin Edwin Yeo

Hi Ahmet,

I've tried that, but it's still not able to show.

Those fields are actually of type=float, type=date and type=int.

By default those field type are not able to be highlighted?

Regards,
Edwin



On 11 June 2015 at 15:03, Ahmet Arslan iori...@yahoo.com.invalid wrote:

 Hi Edwin,

 hl.alternateField is probably what you are looking for.

 ahmet




 On Thursday, June 11, 2015 5:38 AM, Zheng Lin Edwin Yeo 
 edwinye...@gmail.com wrote:
 Hi,

 Is it possible to list all the fields in the highlighting portion in the
 output?
 Currently,even when I str name=hl.fl*/str, it only shows fields where
 highlighting is possible, and fields which highlighting is not possible is
 not shown.

 I would like to have the output where all the fields, regardless if
 highlighting is possible or not, to be shown together.


 Regards,
 Edwin

Re: DocValues memory consumption thoughts

2015-06-11 Thread Alessandro Benedetti

m DocValues actually is an un-inverted index that is built as part of
the segment.
This means that it has the same behaviour of the other segments files.
Assuming you are indexing not a compound segment file but a classic multi
filed segment in a NRTCachingDirectory,
The segment is built in memory , and when it reaches the ramBufferSizeMB/
Hard commit it is flushed to the disk.

This means that in my opinion there is no particular observation of memory
degradation in using the DocValues.
I would actually say that using DocValues instead the old FieldCache is
decreasing the memory consumption, as FiedlChaces are completely in memory
( with the expensive un-inverting process)
From Solr wiki :

In Lucene 4.0, a new approach was introduced. DocValue fields are now
column-oriented fields with a document-to-value mapping built at index
time. This approach promises to relieve some of the memory requirements of
the fieldCache and make lookups for faceting, sorting, and grouping much
faster.

I would manage memory more accordingly to the other feature you will use !
Let me know if I satisfied your curiosity!

Cheers

2015-06-11 15:38 GMT+01:00 adfel70 adfe...@gmail.com:

 I am using DocValues and I am wondering how to configure Solr's processes
 java's heap size: does DocValues uses system cache (off heap memory) or
 heap
 memory? should I take  DocValues into consideration when I calculate heap
 parameters (xmx, xmn, xms...)?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/DocValues-memory-consumption-thoughts-tp4211187.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Merging Sets of Data from Two Different Sources

2015-06-11 Thread Erick Erickson

Here's a skeleton that uses Tika from a SolrJ client. It mixes in
a database too, but the parts are pretty separate.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Thu, Jun 11, 2015 at 7:14 AM, Paden rumsey...@gmail.com wrote:
 You were very VERY helpful. Thank you very much. If I could bug you for one
 last question. Do you know where the documentation is that would help me
 write my own indexer?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Merging-Sets-of-Data-from-Two-Different-Sources-tp4211166p4211180.html
 Sent from the Solr - User mailing list archive at Nabble.com.

44 matches

Mail list logo