Re: charfilter doesn't do anything

2013-09-08 Thread Andreas Owen
yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:

 Hmmm, have you looked at:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Not quite the body, perhaps, but might it help?
 
 
 On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:
 
 ok i have html pages with html.!--body--content i
 want!--/body--./html. i want to extract (index, store) only
 that between the body-comments. i thought regexTransformer would be the
 best because xpath doesn't work in tika and i cant nest a
 xpathEntetyProcessor to use xpath. what i have also found out is that the
 htmlparser from tika cuts my body-comments out and tries to make well
 formed html, which i would like to switch off.
 
 On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
 
 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string
 is on the same line in my tika entity. but when the string is multilined it
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url}
 dataSource=dataUrl onError=skip htmlMapper=identity format=html
 transformer=RegexTransformer
 field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 /entity
 
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a solution for you?
 
 It sounds a little bit like you're interested in stripping all the HTML
 tags out.  Perhaps the HTMLStripCharFilter?
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Something that I already said: By using the KeywordTokenizer, you won't
 be able to search for individual words on your HTML input.  The entire
 input string is treated as a single token, and therefore ONLY exact
 entire-field matches (or certain wildcard matches) will be possible.
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
 
 Note that no matter what you do to your data with the analysis chain,
 Solr will always return the text that was originally indexed in search
 results.  If you need to affect what gets stored as well, perhaps you
 need an Update Processor.
 
 Thanks,
 Shawn
 
 



Re: subindex

2013-09-08 Thread Peyman Faratin
Hi Erick

it makes sense. Thank you for this. 

peyman

On Sep 5, 2013, at 4:11 PM, Erick Erickson erickerick...@gmail.com wrote:

 Nope. You can do this if you've stored _all_ the fields (with the exception
 of
 _version_ and the destinations of copyField directives). But there's no way
 I
 know of to do what you want if you haven't.
 
 If you have, you'd be essentially spinning through all your docs and
 re-indexing
 just the fields you cared about. But if you still have access to your
 original
 docs this would be slower/more complicated than just re-indexing from
 scratch.
 
 Best
 Erick
 
 
 On Wed, Sep 4, 2013 at 1:51 PM, Peyman Faratin pey...@robustlinks.comwrote:
 
 Hi
 
 Is there a way to build a new (smaller) index from an existing (larger)
 index where the smaller index contains a subset of the fields of the larger
 index?
 
 thank you



Re: charfilter doesn't do anything

2013-09-08 Thread Jack Krupansky
I tried this and it seems to work when added to the standard Solr example in 
4.4:


field name=body type=text_html_body indexed=true stored=true /

fieldType name=text_html_body class=solr.TextField 
positionIncrementGap=100 

 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /

   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
/fieldType

That char filter retains only text between body and /body. Is that what 
you wanted?


Indexing this data:

curl 'localhost:8983/solr/update?commit=true' -H 
'Content-type:application/json' -d '

[{id:doc-1,body:abc bodyA test./body def}]'

And querying with these commands:

curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
Shows all data

curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
shows the body text

curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
Shows nothing, HTML tag stripped

In your original query, you didn't show us what your default field, df 
parameter, was.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Sunday, September 08, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:


Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the body, perhaps, but might it help?


On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:


ok i have html pages with html.!--body--content i
want!--/body--./html. i want to extract (index, store) only
that between the body-comments. i thought regexTransformer would be the
best because xpath doesn't work in tika and i cant nest a
xpathEntetyProcessor to use xpath. what i have also found out is that the
htmlparser from tika cuts my body-comments out and tries to make well
formed html, which i would like to switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:


On 9/6/2013 7:09 AM, Andreas Owen wrote:

i've managed to get it working if i use the regexTransformer and string
is on the same line in my tika entity. but when the string is multilined 
it

isn't working even though i tried ?s to set the flag dotall.


entity name=tika processor=TikaEntityProcessor url=${rec.url}

dataSource=dataUrl onError=skip htmlMapper=identity format=html
transformer=RegexTransformer

field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;

replaceWith=QQQ sourceColName=text  /

/entity

then i tried it like this and i get a stackoverflow

field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;

replaceWith=QQQ sourceColName=text  /


in javascript this works but maybe because i only used a small string.


Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out.  Perhaps the HTMLStripCharFilter?



http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory


Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input.  The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.



http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory


Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results.  If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn





Expunge deleting using excessive transient disk space

2013-09-08 Thread Manuel Le Normand
  Hi,
In order to delete part of my index I run a delete by query that intends to
erase 15% of the docs.
I added this params to the solrconfig.xml
mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce2/int
   int name=maxMergeAtOnceExplicit2/int
   double name=maxMergedSegmentMB5000.0/double
   double name=reclaimDeletesWeight10.0/double
   double name=segmentsPerTier15.0/double
/mergePolicy

The extra params were added in order to promote merge of old segments but
with restriction on the transient disk that can be used (as I have only
15GB per shard).

This procedure failed on a no space left on device exception, although
proper calculations show that these params should cause no usage excess of
the transient free disk space I have.
 Looking on the infostream I can see that the first merges do succeed but
older segments are kept in reference thus cannot be deleted until all the
merging are done.

Is there anyway of overcoming this?


Profiling Solr Lucene for query

2013-09-08 Thread Manuel Le Normand
Hello all
Looking on the 10% slowest queries, I get very bad performances (~60 sec
per query).
These queries have lots of conditions on my main field (more than a
hundred), including phrase queries and rows=1000. I do return only id's
though.
I can quite firmly say that this bad performance is due to slow storage
issue (that are beyond my control for now). Despite this I want to improve
my performances.

As tought in school, I started profiling these queries and the data of ~1
minute profile is located here:
http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg

Main observation: most of the time I do wait for readVInt, who's stacktrace
(2 out of 2 thread dumps) is:

catalina-exec-3870 - Thread t@6615
 java.lang.Thread.State: RUNNABLE
 at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
 at
org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
2357)
 at
ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
 at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
 at
org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
 at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
 at
org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
 at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
 at
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)


So I do actually wait for IO as expected, but I might be too many time page
faulting while looking for the TermBlocks (tim file), ie locating the term.
As I reindex now, would it be useful lowering down the termInterval
(default to 128)? As the FST (tip files) are that small (few 10-100 MB) so
there are no memory contentions, could I lower down this param to 8 for
example? The benefit from lowering down the term interval would be to
obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory)
as I do not control the term dictionary file (OS caching, loads an average
of 6% of it).


General configs:
solr 4.3
36 shards, each has few million docs
These 36 servers (each server has 2 replicas) are running virtual, 16GB
memory each (4GB for JVM, 12GB remain for the OS caching),  consuming 260GB
of disk mounted for the index files.


Solr suggest - How to define solr suggest as case insensitive

2013-09-08 Thread Mysurf Mail
My suggest (spellchecker) is returning case sensitive answers. (I use it to
autocomplete - dog and Dog return different phrases)\

my suggest is defined as follows - in solrconfig -

 searchComponent class=solr.SpellCheckComponent name=suggest
lst name=spellchecker
str name=namesuggest/str
str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
str name=fieldsuggest/str  !-- the indexed field to derive
suggestions from --
float name=threshold0.005/float
str name=buildOnCommittrue/str
!--str name=sourceLocationamerican-english/str--
/lst
/searchComponent
requestHandler
class=org.apache.solr.handler.component.SearchHandler
name=/suggest
lst name=defaults
str name=spellchecktrue/str
str name=spellcheck.dictionarysuggest/str
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.count5/str
str name=spellcheck.collatetrue/str
/lst
arr name=components
strsuggest/str
/arr
/requestHandler

in schema

field name=suggest type=phrase_suggest indexed=true
stored=true required=false multiValued=true/

and

copyField source=Name dest=suggest/

and

fieldtype name=phrase_suggest class=solr.TextField
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory

pattern=([^\p{L}\p{M}\p{N}\p{Cs}]*[\p{L}\p{M}\p{N}\p{Cs}\_]+:)|([^\p{L}\p{M}\p{N}\p{Cs}])+
replacement=  replace=all/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.TrimFilterFactory/
  /analyzer
/fieldtype


Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-08 Thread Nutan
Error  got resolved,thanks a lot  Sir.I have been trying since days to
 resolve it.


On Fri, Sep 6, 2013 at 11:36 PM, Chris Hostetter-3 [via Lucene] 
ml-node+s472066n4088604...@n3.nabble.com wrote:


 : it shows type as undefined for dynamic field ignored_* , and I am using

 That means the running solr instance does not know anything about a
 dynamic field named ignored_* -- it doesn't exist.

 : but on the admin page it shows schema :

 the page showing hte schema file just tells you what's on disk -- it has
 no way of knowing if you modified that file after starting up solr.

 ... Wait a minute ... i see your problem now...

 ...
 : /fields
 : dynamicField name=ignored_* type=ignored indexed=false
 stored=true
 : multiValued=true/

 ...your dynamicField/ declaration needs to be inside your fields
 block.


 -Hoss


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088604.html
  To unsubscribe from unknown _stream_source_info while indexing rich doc
 in solr, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4088136code=bnV0YW5zaGluZGUxOTkyQGdtYWlsLmNvbXw0MDg4MTM2fC0xMzEzOTU5Mzcx
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088765.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing pdf files - question.

2013-09-08 Thread Nutan Shinde
Error got resolved,solution was dynamic field / must be within fields
tag.


On Sun, Sep 8, 2013 at 3:31 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Could you show us logs you get when you start your web container?


 2013/9/4 Nutan Shinde nutanshinde1...@gmail.com

  My solrconfig.xml is:
 
 
 
  requestHandler name=/update/extract
  class=solr.extraction.ExtractingRequestHandler 
 
  lst name=defaults
 
  str name=fmap.contentdesc/str   !-to map this field of my table
  which
  is defined as shown below in schem.xml--
 
  str name=lowernamestrue/str
 
  str name=uprefixattr_/str
 
  str name=captureAttrtrue/str
 
  /lst
 
  /requestHandler
 
  lib dir=../../extract regex=.*\.jar /
 
 
 
  Schema.xml:
 
  fields
 
  field name=doc_id type=integer indexed=true stored=true
  multiValued=false/
 
  field name=name type=text indexed=true stored=true
  multiValued=false/
 
  field name=path type=text indexed=true stored=true
  multiValued=false/
 
  field name=desc type=text_split indexed=true stored=true
  multiValued=false/
 
  /fields
 
  types
 
  fieldType name=string class=solr.StrField  /
 
  fieldType name=integer class=solr.IntField /
 
  fieldType name=text class=solr.TextField /
 
  fieldType name=text class=solr.TextField /
 
  /types
 
  dynamicField name=*_i  type=integer  indexed=true  stored=true/
 
  uniqueKeydoc_id/uniqueKey
 
 
 
  I have created extract directory and copied all required .jar and
 solr-cell
  jar files into this extract directory and given its path in lib tag in
  solrconfig.xml
 
 
 
  When I try out this:
 
 
 
  curl
  http://localhost:8080/solr/update/extract?literal.doc_id=1commit=true;
 
  -F myfile=@solr-word.pdf mailto:myfile=@solr-word.pdf   in Windows 7.
 
 
 
  I get /solr/update/extract is not available and sometimes I get access
  denied error.
 
  I tried resolving through net,but in vain.as all the solutions are
 related
  to linux os,im working on Windows.
 
  Please help me and provide solutions related o Windows os.
 
  I referred Apache_solr_4_Cookbook.
 
  Thanks a lot.
 
 



Re: Expunge deleting using excessive transient disk space

2013-09-08 Thread Erick Erickson
Right, but you should have at least as much free space as your total index
size, and I don't see the total index size (but I'm just glancing).

I'm not entirely sure you can precisely calculate the maximum free space
you have relative to the amount needed for merging, some of the people who
wrote that code can probably tell you more.

I'd _really_ try to get more disk space. The amount of engineer time spent
trying to tune this is way more expensive than a disk...

Best,
Erick


On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand 
manuel.lenorm...@gmail.com wrote:

   Hi,
 In order to delete part of my index I run a delete by query that intends to
 erase 15% of the docs.
 I added this params to the solrconfig.xml
 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
int name=maxMergeAtOnce2/int
int name=maxMergeAtOnceExplicit2/int
double name=maxMergedSegmentMB5000.0/double
double name=reclaimDeletesWeight10.0/double
double name=segmentsPerTier15.0/double
 /mergePolicy

 The extra params were added in order to promote merge of old segments but
 with restriction on the transient disk that can be used (as I have only
 15GB per shard).

 This procedure failed on a no space left on device exception, although
 proper calculations show that these params should cause no usage excess of
 the transient free disk space I have.
  Looking on the infostream I can see that the first merges do succeed but
 older segments are kept in reference thus cannot be deleted until all the
 merging are done.

 Is there anyway of overcoming this?



Dynamic Field

2013-09-08 Thread anurag.jain
Hi all,

I am using solr dynamic field. i am storing data in the following format:-


idbatch_*job_* 


So for a doc, data is storing like:-

--
id  batch_21   job_21 job_22   batch_22  ...
--
1   120   01   121  ...
--

Using luke request handler i found that currently there are more than 5k
fields and 300 docs. And fields are always increasing because of dynamic
field. 
So i am worried about solr performance or any unknown issues which can come
to solr. If somebody had experienced please tell me. Please tell the correct
solution to handle these issues.

are there any alternatives of dynamic fields. Can we store information like
below ?

-
idjobs  batch
-
21   {21:0,22:1}{21:120,22:121}
-



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-Field-tp4088775.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Some highlighted snippets aren't being returned

2013-09-08 Thread Eric O'Hanlon
Hi again Everyone,

I didn't get any replies to this, so I thought I'd re-send in case anyone 
missed it and has any thoughts.

Thanks,
Eric

On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted snippets 
 for some, but not all results.  For reference, I'm searching through an index 
 that contains web crawls of human-rights-related websites.  I'm running solr 
 as a webapp under Tomcat and I've included the query's solr params from the 
 Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.mimetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_type__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of_capture_.facet.limit=6group.field=original_urlhl.simple.post=/codefacet.field=domainfacet.field=date_of_capture_facet.field=mimetype_codefacet.field=geographic_focus__facetfacet.field=organization_based_in__facetfacet.field=organization_type__facetfacet.field=language__facetfacet.field=creator_name__facethl.fragsize=600f.creator_name__facet.facet.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=original_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxrows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.facet.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true}
  hits=8 status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all documents that 
 contain the word unangan and return facets, highlights, etc.), I get five 
 search results.  Only three of these are returning highlighted snippets.  
 Here's the highlighting portion of the solr response (note: printed in ruby 
 notation because I'm receiving this response in a Rails app):
 
 
 highlighting=
  
 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
{},
   
 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
{},
   
 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
{},
   20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
{contents=
  [...actual snippet is returned here...]},
   20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
{contents=
  [...actual snippet is returned here...]},
   
 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999=
{contents=
  [...actual snippet is returned here...]},
   
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=componentformat=raw=
{contents=
  [...actual snippet is returned here...]},
   
 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf=
{}}
 
 
 I have eight (as opposed to five) results above because I'm also doing a 
 grouped query, grouping by a field called original_url, and this leads to 
 five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word 
 unangan, as expected, and this term is appearing in a text field that's 
 indexed and stored, and being searched for all text searches.  For example, 
 one of the search results is for a crawl of this document: 
 http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf
 
 And if you view that document on the web, you'll see that it does contain 
 unangan.
 
 Has anyone seen this before?  And does anyone have any good suggestions for 
 troubleshooting/fixing the problem?
 
 Thanks!
 
 - Eric



Re: Some highlighted snippets aren't being returned

2013-09-08 Thread Bill Bell
Zip up all your configs 

Bill Bell
Sent from mobile


On Sep 8, 2013, at 3:00 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

 Hi again Everyone,
 
 I didn't get any replies to this, so I thought I'd re-send in case anyone 
 missed it and has any thoughts.
 
 Thanks,
 Eric
 
 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:
 
 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted snippets 
 for some, but not all results.  For reference, I'm searching through an 
 index that contains web crawls of human-rights-related websites.  I'm 
 running solr as a webapp under Tomcat and I've included the query's solr 
 params from the Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.mimetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_type__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of_capture_.facet.limit=6group.field=original_urlhl.simple.post=/codefacet.field=domainfacet.field=date_of_capture_facet.field=mimetype_codefacet.field=geographic_focus__facetfacet.field=organization_based_in__facetfacet.field=organization_type__facetfacet.field=language__facetfacet.field=creator_name__facethl.fragsize=600f.creator_name__facet.facet.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=original_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxrows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.facet.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true}
  hits=8 status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all documents that 
 contain the word unangan and return facets, highlights, etc.), I get five 
 search results.  Only three of these are returning highlighted snippets.  
 Here's the highlighting portion of the solr response (note: printed in 
 ruby notation because I'm receiving this response in a Rails app):
 
 
 highlighting=
 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
   {},
  
 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
   {},
  
 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
   {},
  20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
   {contents=
 [...actual snippet is returned here...]},
  20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
   {contents=
 [...actual snippet is returned here...]},
  
 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999=
   {contents=
 [...actual snippet is returned here...]},
  
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=componentformat=raw=
   {contents=
 [...actual snippet is returned here...]},
  
 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf=
   {}}
 
 
 I have eight (as opposed to five) results above because I'm also doing a 
 grouped query, grouping by a field called original_url, and this leads to 
 five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word 
 unangan, as expected, and this term is appearing in a text field that's 
 indexed and stored, and being searched for all text searches.  For example, 
 one of the search results is for a crawl of this document: 
 http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf
 
 And if you view that document on the web, you'll see that it does contain 
 unangan.
 
 Has anyone seen this before?  And does anyone have any good suggestions for 
 troubleshooting/fixing the problem?
 
 Thanks!
 
 - Eric
 


Re: Dynamic Field

2013-09-08 Thread Jack Krupansky

2. Flatten your data.
3. Use dynamic and multivalued fields only in moderation.
1. First, tell us how your application intends to use and query your data. 
That will be a guide to how your data should be stored.


-- Jack Krupansky

-Original Message- 
From: anurag.jain

Sent: Sunday, September 08, 2013 3:49 PM
To: solr-user@lucene.apache.org
Subject: Dynamic Field

Hi all,

I am using solr dynamic field. i am storing data in the following format:-


idbatch_*job_*


So for a doc, data is storing like:-

--
id  batch_21   job_21 job_22   batch_22  ...
--
1   120   01   121  ...
--

Using luke request handler i found that currently there are more than 5k
fields and 300 docs. And fields are always increasing because of dynamic
field.
So i am worried about solr performance or any unknown issues which can come
to solr. If somebody had experienced please tell me. Please tell the correct
solution to handle these issues.

are there any alternatives of dynamic fields. Can we store information like
below ?

-
idjobs  batch
-
21   {21:0,22:1}{21:120,22:121}
-



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-Field-tp4088775.html
Sent from the Solr - User mailing list archive at Nabble.com. 



SOLR index Recovery availability

2013-09-08 Thread atuldj.jadhav
Hi Team,Need your suggestions/views on the approach I have in place for SOLR
availability and recovery.
I am running *SOLR 3.5* and have around *30k* document's indexed in my SOLR
core. I have configured SOLR to hold *5k * documents in each segment at a
time.I periodically commit  optimize my SOLR index. 

I have delta indexing in place to index new documents in SOLR, /very rarely
/ I face index corruption issue, to fix this issue I have *checkindex -fix*
job in place as well.However sometime this job can delete the corrupt
segment! (meaning loss of 5K documents, till I full Re-index SOLR.)

_*I have few follow up questions on this case.*_
1. How can I avoid loss of 5K documents (checkindex -fix), shall I reduce
number of documents per segments count? is there an alternate solution?

2. If I start taking periodic backup (snapshot) of entire index, shall I
just replace my data/index folder from the backup folder in case corruption
is found? Is this a good implementation? 

3. Any other good solution, suggestion to have maximum index availability
all the time? 

Thanks in advance for giving your time. 

Atul 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-index-Recovery-availability-tp4088782.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR index Recovery availability

2013-09-08 Thread Walter Underwood
This sounds very complicated for only 30K documents. Put them all on one 
server, give it enough memory so that the index can all be in file buffers. If 
there is a disaster, reindex everything. That should only take a few minutes.

And don't optimize.

wunder

On Sep 8, 2013, at 3:01 PM, atuldj.jadhav wrote:

 Hi Team,Need your suggestions/views on the approach I have in place for SOLR
 availability and recovery.
 I am running *SOLR 3.5* and have around *30k* document's indexed in my SOLR
 core. I have configured SOLR to hold *5k * documents in each segment at a
 time.I periodically commit  optimize my SOLR index. 
 
 I have delta indexing in place to index new documents in SOLR, /very rarely
 / I face index corruption issue, to fix this issue I have *checkindex -fix*
 job in place as well.However sometime this job can delete the corrupt
 segment! (meaning loss of 5K documents, till I full Re-index SOLR.)
 
 _*I have few follow up questions on this case.*_
 1. How can I avoid loss of 5K documents (checkindex -fix), shall I reduce
 number of documents per segments count? is there an alternate solution?
 
 2. If I start taking periodic backup (snapshot) of entire index, shall I
 just replace my data/index folder from the backup folder in case corruption
 is found? Is this a good implementation? 
 
 3. Any other good solution, suggestion to have maximum index availability
 all the time? 
 
 Thanks in advance for giving your time. 
 
 Atul 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-index-Recovery-availability-tp4088782.html
 Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org





multiple update processor chains.

2013-09-08 Thread mike st. john
is it possible to have multiple run by default?

i've tried adding multiple update.chains for the  UpdateRequestHandler but
it didn't seem to work.


wondering if its even possible.



Thanks

msj


Data import

2013-09-08 Thread Luís Portela Afonso
Hi,

It's possible to disable document update when running data import, full-import 
command?

Thanks

smime.p7s
Description: S/MIME cryptographic signature


RE: Some highlighted snippets aren't being returned

2013-09-08 Thread Bryan Loofbourrow
Eric,

Your example document is quite long. Are you setting hl.maxAnalyzedChars?
If you don't, the highlighter you appear to be using will not look past
the first 51,200 characters of the document for snippet candidates.

http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

-- Bryan


 -Original Message-
 From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
 Sent: Sunday, September 08, 2013 2:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Some highlighted snippets aren't being returned

 Hi again Everyone,

 I didn't get any replies to this, so I thought I'd re-send in case
anyone
 missed it and has any thoughts.

 Thanks,
 Eric

 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

  Hi Everyone,
 
  I'm facing an issue in which my solr query is returning highlighted
 snippets for some, but not all results.  For reference, I'm searching
 through an index that contains web crawls of human-rights-related
 websites.  I'm running solr as a webapp under Tomcat and I've included
the
 query's solr params from the Tomcat log:
 
  ...
  webapp=/solr-4.2
  path=/select
 

params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m

imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t

ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of

_capture_.facet.limit=6group.field=original_urlhl.simple.post=/code

facet.field=domainfacet.field=date_of_capture_facet.field=mimetype

_codefacet.field=geographic_focus__facetfacet.field=organization_based_i

n__facetfacet.field=organization_type__facetfacet.field=language__facet

facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face

t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig

inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr

ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac
 et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8
 status=0 QTime=108
  ...
 
  For the query above (which can be simplified to say: find all
documents
 that contain the word unangan and return facets, highlights, etc.), I
 get five search results.  Only three of these are returning highlighted
 snippets.  Here's the highlighting portion of the solr response (note:
 printed in ruby notation because I'm receiving this response in a Rails
 app):
 
  
  highlighting=
 

{20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
 202002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
 uu-no-39-tahun-1999=
 {contents=
   [...actual snippet is returned here...]},
 
20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
 39-tahun-1999?tmpl=componentformat=raw=
 {contents=
   [...actual snippet is returned here...]},
 

20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
 timut_heritage.pdf=
 {}}
  
 
  I have eight (as opposed to five) results above because I'm also doing
a
 grouped query, grouping by a field called original_url, and this leads
 to five grouped results.
 
  I've confirmed that my highlight-lacking results DO contain the word
 unangan, as expected, and this term is appearing in a text field
that's
 indexed and stored, and being searched for all text searches.  For
 example, one of the search results is for a crawl of this document:

http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
 df
 
  And if you view that document on the web, you'll see that it does
 contain unangan.
 
  Has anyone seen this before?  And does anyone have any good
suggestions
 for troubleshooting/fixing the problem?
 
  Thanks!
 
  - Eric


Re: Tweaking boosts for more search results variety

2013-09-08 Thread Sai Gadde
Sorry for the delayed response.

Limitations in this scenario where we have 5 million indexed documents from
about only 1000 sites. If results are grouped by site we will not be able
to show more than a couple of pages for lot of search keywords.


Ex: Search for Solr has 1000 matches but only from 20 sites.
In these 20 sites
10 sites are of sitetype A - boost 5
7 sites are of sitetype B - boost 2
3 sites are of sitetype C - boost 1

Limitation 1: If these are grouped by site only 20 results would be
displayed in 2 pages (10 per page).

We still want to display all the results. For a better user experience
Ideally we would like to have 10 results in page 1  from 10 distinct
sites of sitetype A (which has higher boost already) or In a real world
scenario from 7-8 distinct sites. In our case we see like 7 matches on a
page from a single site.

Limitation 2: Inverse Document frequency (IDF) would have helped here but,
in that case our preferential boost for sitetypes is ignored and some
results from sitetype C would come on top due to IDF boost.

What we want to achieve is any way to control variety of sites displayed in
search results with preferential boost still in place.

Thanks in advance




On Sun, Sep 8, 2013 at 6:36 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 What do you mean with *these limitations *Do you want to make multiple
 grouping at same time?


 2013/9/6 Sai Gadde gadde@gmail.com

  Thank you Jack for the suggestion.
 
  We can try group by site. But considering that number of sites are only
  about 1000 against the index size of 5 million, One can expect most of
 the
  hits would be hidden and for certain specific keywords only a handful of
  actual results could be displayed if results are grouped by site.
 
  we already group on a signature field to identify duplicate content in
  these 5 million+ docs. But here the number of duplicates are only about
  3-5% maximum.
 
  Is there any workaround for these limitations with grouping?
 
  Thanks
  Shyam
 
 
 
  On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky j...@basetechnology.com
  wrote:
 
   The grouping (field collapsing) feature somewhat addresses this - group
  by
   a site field and then if more than one or a few top pages are from
 the
   same site they get grouped or collapsed so that you can see more sites
  in a
   few results.
  
   See:
   http://wiki.apache.org/solr/**FieldCollapsing
  http://wiki.apache.org/solr/FieldCollapsing
   https://cwiki.apache.org/**confluence/display/solr/**Result+Grouping
  https://cwiki.apache.org/confluence/display/solr/Result+Grouping
  
   -- Jack Krupansky
  
   -Original Message- From: Sai Gadde
   Sent: Thursday, September 05, 2013 2:27 AM
   To: solr-user@lucene.apache.org
   Subject: Tweaking boosts for more search results variety
  
  
   Our index is aggregated content from various sites on the web. We want
  good
   user experience by showing multiple sites in the search results. In our
   setup we are seeing most of the results from same site on the top.
  
   Here is some information regarding queries and schema
  site - String field. We have about 1000 sites in index
  sitetype - String field.  we have 3 site types
   omitNorms=true for both the fields
  
   Doc count varies largely based on site and sitetype by a factor of 10 -
   1000 times
   Total index size is about 5 million docs.
   Solr Version: 4.0
  
   In our queries we have a fixed and preferential boost for certain
 sites.
   sitetype has different and fixed boosts for 3 possible values. We
 turned
   off Inverse Document Frequency (IDF) for these boosts to work properly.
   Other text fields are boosted based on search keywords only.
  
   With this setup we often see a bunch of hits from a single site
 followed
  by
   next etc.,
   Is there any solution to see results from variety of sites and still
 keep
   the preferential boosts in place?
  
 



Re: Data import

2013-09-08 Thread Alexandre Rafalovitch
What do you specifically mean by the disable document update? Do you mean
in-place update? Or do you mean you want to run the import but not actually
populate Solr collection with processed documents?

It might help to explain the business level goal you are trying to achieve.
Or, specific error that you are perhaps seeing and trying to avoid.

Regards,
   Alex.


Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Sep 9, 2013 at 6:42 AM, Luís Portela Afonso
meligalet...@gmail.comwrote:

 Hi,

 It's possible to disable document update when running data import,
 full-import command?

 Thanks


Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?

2013-09-08 Thread diyun2008
Thank you Erick. It's very useful to me. I have already started to merge logs
of collections to 15 collections. but there's another question. If I merge
1000 collections to 1 collection, to the new collection it will have about
20G data and about 30M records. In 1 solr server, I will create 15 such big
collections. So I don't know if solr can support such big data in 1
collection(20G data with 30M records) or in 1 solr server(15*20G data with
15*30M records)? Or do I need buy new servers to install solr and do shrding
to support that? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088802.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiple update processor chains.

2013-09-08 Thread Alexandre Rafalovitch
Only one chain per handler. But then you can define any sequence inside the
chain, so why do you care about multiple chains?

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote:

 is it possible to have multiple run by default?

 i've tried adding multiple update.chains for the  UpdateRequestHandler but
 it didn't seem to work.


 wondering if its even possible.



 Thanks

 msj



Re: Loading a SpellCheck dynamically

2013-09-08 Thread Mr Havercamp

Hi Thanks for the response.

Per your instructions, I have set up additional request handlers for 
handling language-specific /selects:


!-- generic query --
requestHandler name=/select class=solr.SearchHandler
lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dftext/str

str name=spellchecktrue/str
str name=spellcheck.collatetrue/str
str name=spellcheck.onlyMorePopularfalse/str
str name=spellcheck.extendedResultsfalse/str
str name=spellcheck.count3/str
/lst
arr name=last-components
strspellcheck/str
/arr
/requestHandler

!-- English-specific query --
requestHandler name=/select_en class=solr.SearchHandler
lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dftext/str

str name=spellchecktrue/str
str name=spellcheck.collatetrue/str
str name=spellcheck.onlyMorePopularfalse/str
str name=spellcheck.extendedResultsfalse/str
str name=spellcheck.count3/str
/lst
arr name=last-components
strspellcheck_en/str
/arr
/requestHandler

While it may require additional setup I think it works quite elegantly 
and allows me to do more language-targeted queries in addition to spell 
suggest.


Thanks again.

Cheers


Hayden

On 06/09/13 16:35, Shalin Shekhar Mangar wrote:

My guess is that you have a single request handler defined with all
your language specific spell check components. This is why you see
spellcheck values from all spellcheckers.

If the above is true, then I don't think there is a way to choose one
specific spellchecker component. The alternative is to define multiple
request handlers with one-to-one mapping with the spell check
components. Then you can send a request to one particular request
handler and the corresponding spell check component will return its
response.

On Thu, Sep 5, 2013 at 11:29 PM, Mr Havercamp mrhaverc...@gmail.com wrote:

I currently have multiple spellchecks configured in my solrconfig.xml to
handle a variety of different spell suggestions in different languages.

In the snippet below, I have a catch-all spellcheck as well as an English
only one for more accurate matching (I.e. my schema.xml is set up to capture
english only fields to an english-specific textSpell_en field and then I
also capture to a generic textSpell field):

---solrconfig.xml---

 searchComponent name=spellcheck_en class=solr.SpellCheckComponent
 str name=queryAnalyzerFieldTypetextSpell_en/str

 lst name=spellchecker
 str name=namedefault/str
 str name=fieldspell_en/str
 str name=spellcheckIndexDir./spellchecker_en/str
 str name=buildOnOptimizetrue/str
 /lst
 /searchComponent

 searchComponent name=spellcheck class=solr.SpellCheckComponent
 str name=queryAnalyzerFieldTypetextSpell/str

 lst name=spellchecker
 str name=namedefault/str
 str name=fieldspell/str
 str name=spellcheckIndexDir./spellchecker/str
 str name=buildOnOptimizetrue/str
 /lst
 /searchComponent

My question is; when I query my Solr index, am I able to load, say, just
spellcheck values from the spellcheck_en spellchecker rather than from both?
This would be useful if I were to start implementing additional language
spellchecks; E.g. spellcheck_ja, spellcheck_fr, etc.

Thanks for any insights.

Cheers


Hayden







Searching solr on school name during year

2013-09-08 Thread Rohit Kumar
Hi,

Currently I have a student search which allows me to search for documents
in a school. I am looking at including year search into the existing schema
which would enable users to search for students in a school during an year.
I have a proposed change in the schema to add the year component to
facilitate this search.


Existing schema: (No year information currently)

field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=name type=text_general indexed=true stored=true /
field name=schoolName type=text_general indexed=true stored=true
multiValued=true/

Current sample data:
name:Borris Mayers
schoolName:Canterbury University




New schema:

field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=name type=text_general indexed=true stored=true /
field name=schoolName type=text_general indexed=true stored=true
multiValued=true/
field name=schoolNameWithTermOriginal type=string indexed=false
stored=true multiValued=true/


Sample data:

name:Borris Mayers
schoolName:Canterbury University, start_2001, year_2001, year_2002,
year_2003, year_2004, year_2005, end_2005
schoolNameWithTermOriginal:Canterbury University||2001-2005


Please suggest if its a correct approach or there is a better way to do the
same.
I am using Solr 4.3.


Thanks,
Rohit Kumar