Re: Some highlighted snippets aren't being returned

2013-09-09 Thread Aloke Ghoshal
Hi Eric,

As Bryan suggests, you should look at appropriately setting up the
fragSize  maxAnalyzedChars for long documents.

One issue I find with your search request is that in trying to
highlight across three separate fields, you have added each of them as
a separate request param:
hl.fl=contentshl.fl=titlehl.fl=original_url

The way to do it would be
(http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
them as values to one comma (or space) separated field:
hl.fl=contents,title,original_url

Regards,
Aloke

On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote:
 Eric,

 Your example document is quite long. Are you setting hl.maxAnalyzedChars?
 If you don't, the highlighter you appear to be using will not look past
 the first 51,200 characters of the document for snippet candidates.

 http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

 -- Bryan


 -Original Message-
 From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
 Sent: Sunday, September 08, 2013 2:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Some highlighted snippets aren't being returned

 Hi again Everyone,

 I didn't get any replies to this, so I thought I'd re-send in case
 anyone
 missed it and has any thoughts.

 Thanks,
 Eric

 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

  Hi Everyone,
 
  I'm facing an issue in which my solr query is returning highlighted
 snippets for some, but not all results.  For reference, I'm searching
 through an index that contains web crawls of human-rights-related
 websites.  I'm running solr as a webapp under Tomcat and I've included
 the
 query's solr params from the Tomcat log:
 
  ...
  webapp=/solr-4.2
  path=/select
 

 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m

 imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t

 ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of

 _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code

facet.field=domainfacet.field=date_of_capture_facet.field=mimetype

 _codefacet.field=geographic_focus__facetfacet.field=organization_based_i

 n__facetfacet.field=organization_type__facetfacet.field=language__facet

 facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face

 t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig

 inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr

 ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac
 et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8
 status=0 QTime=108
  ...
 
  For the query above (which can be simplified to say: find all
 documents
 that contain the word unangan and return facets, highlights, etc.), I
 get five search results.  Only three of these are returning highlighted
 snippets.  Here's the highlighting portion of the solr response (note:
 printed in ruby notation because I'm receiving this response in a Rails
 app):
 
  
  highlighting=
 

 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
 202002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
 uu-no-39-tahun-1999=
 {contents=
   [...actual snippet is returned here...]},
 
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
 39-tahun-1999?tmpl=componentformat=raw=
 {contents=
   [...actual snippet is returned here...]},
 

 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
 timut_heritage.pdf=
 {}}
  
 
  I have eight (as opposed to five) results above because I'm also doing
 a
 grouped query, grouping by a field called original_url, and this leads
 to five grouped results.
 
  I've confirmed that my highlight-lacking results DO contain the word
 unangan, as expected, and this term is appearing in a text field
 that's
 indexed and stored, and being searched for all text searches.  For
 example, one of the search results is for a crawl of this document:

 http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
 df
 
  And if you view that document on the web, you'll see that it does
 contain unangan.
 
  Has anyone seen this before?  And does anyone have any good
 suggestions
 for troubleshooting/fixing the problem?
 
  Thanks!
 
  - Eric



Re: multiple update processor chains.

2013-09-09 Thread mike st. john
Alexandre,

it was setup with multiple processors and working fine.   I just noticed in
the docs, it mentioned you could have multiple chains, it seemed to make
sense to have the ability to chain the defined processors in order without
the need to merge them into a single update processor definition.

thanks
msj


On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Only one chain per handler. But then you can define any sequence inside the
 chain, so why do you care about multiple chains?

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote:

  is it possible to have multiple run by default?
 
  i've tried adding multiple update.chains for the  UpdateRequestHandler
 but
  it didn't seem to work.
 
 
  wondering if its even possible.
 
 
 
  Thanks
 
  msj
 



Re: multiple update processor chains.

2013-09-09 Thread Alexandre Rafalovitch
Which section in the docs specifically? I thought it was multiple chains
per config file, but you had to choose your specific chain for individual
processors.

I might be wrong though.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Sep 9, 2013 at 1:51 PM, mike st. john mstj...@gmail.com wrote:

 Alexandre,

 it was setup with multiple processors and working fine.   I just noticed in
 the docs, it mentioned you could have multiple chains, it seemed to make
 sense to have the ability to chain the defined processors in order without
 the need to merge them into a single update processor definition.

 thanks
 msj


 On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch
 arafa...@gmail.comwrote:

  Only one chain per handler. But then you can define any sequence inside
 the
  chain, so why do you care about multiple chains?
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
  On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote:
 
   is it possible to have multiple run by default?
  
   i've tried adding multiple update.chains for the  UpdateRequestHandler
  but
   it didn't seem to work.
  
  
   wondering if its even possible.
  
  
  
   Thanks
  
   msj
  
 



Ideal Server Environment

2013-09-09 Thread Raheel Hasan
Hi guyz,

I am trying to setup a LIVE environment for my project that uses Apache
Solr  along with PHP/MySQL...

The indexing is of heavy data (about many GBs)..

Please can someone recommend the best server for this?

Thanks a lot.


-- 
Regards,
Raheel Hasan


Re: Ideal Server Environment

2013-09-09 Thread Raheel Hasan
Also, I wonder if Solr will require High processor? High Memory or High
Storage?

1) For Indexing
2) For querying


On Mon, Sep 9, 2013 at 12:36 PM, Raheel Hasan raheelhasan@gmail.comwrote:

 Hi guyz,

 I am trying to setup a LIVE environment for my project that uses Apache
 Solr  along with PHP/MySQL...

 The indexing is of heavy data (about many GBs)..

 Please can someone recommend the best server for this?

 Thanks a lot.


 --
 Regards,
 Raheel Hasan




-- 
Regards,
Raheel Hasan


Re: multiple update processor chains.

2013-09-09 Thread mike st. john
Your correct, its not specifically for the update.chain.   my mistake.

thanks

msj


On Mon, Sep 9, 2013 at 3:34 AM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 Which section in the docs specifically? I thought it was multiple chains
 per config file, but you had to choose your specific chain for individual
 processors.

 I might be wrong though.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Mon, Sep 9, 2013 at 1:51 PM, mike st. john mstj...@gmail.com wrote:

  Alexandre,
 
  it was setup with multiple processors and working fine.   I just noticed
 in
  the docs, it mentioned you could have multiple chains, it seemed to make
  sense to have the ability to chain the defined processors in order
 without
  the need to merge them into a single update processor definition.
 
  thanks
  msj
 
 
  On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch
  arafa...@gmail.comwrote:
 
   Only one chain per handler. But then you can define any sequence inside
  the
   chain, so why do you care about multiple chains?
  
   Regards,
  Alex.
  
   Personal website: http://www.outerthoughts.com/
   LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
   - Time is the quality of nature that keeps events from happening all at
   once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
  
  
   On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com
 wrote:
  
is it possible to have multiple run by default?
   
i've tried adding multiple update.chains for the
  UpdateRequestHandler
   but
it didn't seem to work.
   
   
wondering if its even possible.
   
   
   
Thanks
   
msj
   
  
 



Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?

2013-09-09 Thread Yago Riveiro
If you want have more collections you need to configure in zookeeper and solr 
the -Djute.maxbuffer variable to override the default limitation. 

In zookeeper you can configure it in zookeeper-env.sh file. On Solr pass the 
variable like the others.

Note: In both cases the value configured need to be the same or bad things can 
happen.

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, September 9, 2013 at 5:01 AM, diyun2008 wrote:

 Thank you Erick. It's very useful to me. I have already started to merge logs
 of collections to 15 collections. but there's another question. If I merge
 1000 collections to 1 collection, to the new collection it will have about
 20G data and about 30M records. In 1 solr server, I will create 15 such big
 collections. So I don't know if solr can support such big data in 1
 collection(20G data with 30M records) or in 1 solr server(15*20G data with
 15*30M records)? Or do I need buy new servers to install solr and do shrding
 to support that? 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088802.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 




Re: Dynamic Field

2013-09-09 Thread Alvaro Cabrerizo
Hi:

As you posted, a possibility could be, to define the fields jobs and
batch as multivalued and use the partial
updatehttp://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/to
add new values to those fields.

Hope it helps.



On Sun, Sep 8, 2013 at 9:49 PM, anurag.jain anurag.k...@gmail.com wrote:

 Hi all,

 I am using solr dynamic field. i am storing data in the following format:-

 
 idbatch_*job_*
 

 So for a doc, data is storing like:-


 --
 id  batch_21   job_21 job_22   batch_22  ...

 --
 1   120   01   121  ...

 --

 Using luke request handler i found that currently there are more than 5k
 fields and 300 docs. And fields are always increasing because of
 dynamic
 field.
 So i am worried about solr performance or any unknown issues which can come
 to solr. If somebody had experienced please tell me. Please tell the
 correct
 solution to handle these issues.

 are there any alternatives of dynamic fields. Can we store information like
 below ?

 -
 idjobs  batch
 -
 21   {21:0,22:1}{21:120,22:121}
 -



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Dynamic-Field-tp4088775.html
 Sent from the Solr - User mailing list archive at Nabble.com.



help regarding custom query which returns custom output

2013-09-09 Thread Rohan Thakur
hi all

I have requirement like I have implemented fulltext search and
autosuggestion and spellcorrection functionality in solr but they all are
running on different cores so I have to call 3 different request handlers
for getting the results which is adding the unnecessary delay so I wanted
to know is there any solution that I call just one request URL and get all
these three results and json feedback from solr.

thanx
regards
rohan


Re: Ideal Server Environment

2013-09-09 Thread Toke Eskildsen
On Mon, 2013-09-09 at 09:39 +0200, Raheel Hasan wrote:
 Also, I wonder if Solr will require High processor? High Memory or High
 Storage?
 
 1) For Indexing

* Processor
* Bulk read/write.

 2) For querying

* Processor only if you have complex queries
* Fast random I/O reads, which can be accomplished either by having
enough RAM to cache most or all of your index or by using SSDs.


Your question is much too generic to go into specific hardware. Read
https://wiki.apache.org/lucene-java/ImproveIndexingSpeed
https://wiki.apache.org/lucene-java/ImproveSearchingSpeed
https://wiki.apache.org/solr/SolrPerformanceProblems
then build a test instance, measure and scale from there.

- Toke Eskildsen



Re: Ideal Server Environment

2013-09-09 Thread Raheel Hasan
ok thanks for the reply

Also, could you tell me if CentOS or Ubuntu will be better?


On Mon, Sep 9, 2013 at 3:17 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Mon, 2013-09-09 at 09:39 +0200, Raheel Hasan wrote:
  Also, I wonder if Solr will require High processor? High Memory or High
  Storage?
 
  1) For Indexing

 * Processor
 * Bulk read/write.

  2) For querying

 * Processor only if you have complex queries
 * Fast random I/O reads, which can be accomplished either by having
 enough RAM to cache most or all of your index or by using SSDs.


 Your question is much too generic to go into specific hardware. Read
 https://wiki.apache.org/lucene-java/ImproveIndexingSpeed
 https://wiki.apache.org/lucene-java/ImproveSearchingSpeed
 https://wiki.apache.org/solr/SolrPerformanceProblems
 then build a test instance, measure and scale from there.

 - Toke Eskildsen




-- 
Regards,
Raheel Hasan


Re: Profiling Solr Lucene for query

2013-09-09 Thread Dmitry Kan
are you querying your shards via a frontend solr? We have noticed, that
querying becomes much faster if results merging can be avoided.

Dmitry


On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand 
manuel.lenorm...@gmail.com wrote:

 Hello all
 Looking on the 10% slowest queries, I get very bad performances (~60 sec
 per query).
 These queries have lots of conditions on my main field (more than a
 hundred), including phrase queries and rows=1000. I do return only id's
 though.
 I can quite firmly say that this bad performance is due to slow storage
 issue (that are beyond my control for now). Despite this I want to improve
 my performances.

 As tought in school, I started profiling these queries and the data of ~1
 minute profile is located here:
 http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg

 Main observation: most of the time I do wait for readVInt, who's stacktrace
 (2 out of 2 thread dumps) is:

 catalina-exec-3870 - Thread t@6615
  java.lang.Thread.State: RUNNABLE
  at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
  at

 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
 2357)
  at

 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
  at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
  at

 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
  at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
  at

 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
  at
 org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
  at

 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
  at
 oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
  at

 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
  at
 org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
  at

 org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)


 So I do actually wait for IO as expected, but I might be too many time page
 faulting while looking for the TermBlocks (tim file), ie locating the term.
 As I reindex now, would it be useful lowering down the termInterval
 (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so
 there are no memory contentions, could I lower down this param to 8 for
 example? The benefit from lowering down the term interval would be to
 obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory)
 as I do not control the term dictionary file (OS caching, loads an average
 of 6% of it).


 General configs:
 solr 4.3
 36 shards, each has few million docs
 These 36 servers (each server has 2 replicas) are running virtual, 16GB
 memory each (4GB for JVM, 12GB remain for the OS caching),  consuming 260GB
 of disk mounted for the index files.



Facet Sort with non ASCII Characters

2013-09-09 Thread Sandro Zbinden
Dear solr users

Is there a plan to add support for alphabetical facet sorting with non ASCII 
Characters ?

Best regards Sandro



Sandro Zbinden
Software Engineer





Facet sort descending

2013-09-09 Thread Sandro Zbinden
Dear solr users

Is there a plan to add a descending sort order for facet queries ?
Best regards Sandro


Sandro Zbinden
Software Engineer





Re: More on topic of Meta-search/Federated Search with Solr

2013-09-09 Thread Jakub Skoczen
Hi Dan,

You might want to take a look at pazpar2 [1], an open-source, federated
search engine with first-class support for SOLR (with addition to standard
information retrieval protocols like Z39.50/SRU).

[1] http://www.indexdata.com/pazpar2


On Thu, Sep 5, 2013 at 9:55 PM, Paul Libbrecht p...@hoplahup.net wrote:

 Hello list,

 A student of a friend of mine made his masters on that topic, especially
 about federated ranking.

 I have copied his text here:

 http://direct.hoplahup.net/tmp/FederatedRanking-Koblischke-2009.pdf

 Feel free to contact me to contact Robert Koblischke for questions.

 Paul


 On 28 août 2013, at 20:35, Dan Davis wrote:

  On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha shanuu@gmail.com wrote:
 
  Would you like to create something like
  http://knimbus.com
 
 
  I work at the National Library of Medicine.   We are moving our library
  catalog to a newer platform, and we will probably include articles.   The
  article's content and meta-data are available from a number of web-scale
  discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's
 traditional
  API.   Most libraries use open source solutions to avoid the cost of
  purchasing an expensive enterprise search platform.   We are big; we
  already have a closed-source enterprise search engine (and our own home
  grown Entrez search used for PubMed).Since we can already do
 Federated
  Search with the above, I am evaluating the effort of adding such to
 Apache
  Solr.   Because NLM data is used in the open relevancy project, we
 actually
  have the relevancy decisions to decide whether we have done a good job of
  it.
 
  I obviously think it would be Fun to add Federated Search to Apache
 Solr.
 
  *Standard disclosure *- my opinion's do not represent the opinions of NIH
  or NLM.Fun is no reason to spend tax-payer money.Enhancing
 Apache
  Solr would reduce the risk of putting all our eggs in one basket. and
  there may be some other relevant benefits.
 
  We do use Apache Solr here for more than one other project... so keep up
  the good work even if my working group decides to go with the
 closed-source
  solution.




-- 

Cheers,
Jakub


Re: Ideal Server Environment

2013-09-09 Thread Toke Eskildsen
On Mon, 2013-09-09 at 12:42 +0200, Raheel Hasan wrote:

 Also, could you tell me if CentOS or Ubuntu will be better?

You are asking for short answers to complex questions.

There is nothing inherent in Solr that favours one Linux installation
over another. CentOS is aimed at the enterprise, so I _guess_ that it
will be preferable if you have a sysadmin to handle the underlying
system for you. If you are to manage it yourself, I would recommend
Ubuntu as it is aimed at end-users.

- Toke Eskildsen



Re: Solr Cell Question

2013-09-09 Thread Jamie Johnson
Thanks Erick,  This is how I was doing it but when I saw the Solr Cell
stuff I figured I'd give it a go.  What I ended up doing is the following

ModifiableSolrParams params = indexer.index(artifact);

 params.add(fmap.content, my_custom_field);

 params.add(extractFormat, text);

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
/update/extract);

 up.setParams(params);

 FileStream f = new FileStream(new File());

 up.addContentStream(f);


On Fri, Sep 6, 2013 at 9:54 AM, Erick Erickson erickerick...@gmail.comwrote:

 It's always frustrating when someone replies with Why not do it
 a completely different way?.  But I will anyway :).

 There's no requirement at all that you send things to Solr to make
 Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
 anyway, why not just parse on the client? This has the advantage
 of allowing you to offload the Tika processing from Solr which can
 be quite expensive. You can use the same Tika jars that come
 with Solr or download whatever version from the Tika project
 you want. That way, you can exercise much better control over
 what's done.

 Here's a skeletal program with indexing from a DB mixed in, but
 it shouldn't be hard at all to pull the DB parts out.

 http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

 FWIW,
 Erick


 On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote:

  Is it possible to configure solr cell to only extract and store the body
 of
  a document when indexing?  I'm currently doing the following which I
  thought would work
 
  ModifiableSolrParams params = new ModifiableSolrParams();
 
   params.set(defaultField, content);
 
   params.set(xpath, /xhtml:html/xhtml:body/descendant::node());
 
   ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
  /update/extract);
 
   up.setParams(params);
 
   FileStream f = new FileStream(new File(..));
 
   up.addContentStream(f);
 
  up.setAction(ACTION.COMMIT, true, true);
 
  solrServer.request(up);
 
 
  But the result of content is as follows
 
  arr name=content_mvtxt
  str/
  strnull/str
  strISO-8859-1/str
  strtext/plain; charset=ISO-8859-1/str
  strJust a little test/str
  /arr
 
 
  What I had hoped for was just
 
  arr name=content_mvtxt
  strJust a little test/str
  /arr
 



SOLR 4 stopwords and token positions

2013-09-09 Thread Fermin Silva
Hi Everyone,

I'm migrating from SOLR 3.x to 4.x and I'm required to keep the results as
close as possible as before.
So I'm running some tests and found some differences.

My query is: *title_search_pt:(geladeira/refrigerador)*
And the parsed query becomes: *MultiPhraseQuery(title_search_pt:(refriger
geladeir) (refriger geladeir))*
*
*
This is identical in both instances (3.x and 4.x) so that's not the problem.

My document is:
*balcão refrigerado e geladeira frigorifica*
*
*
Which, after analysis, becomes:
*balca refriger geladeir frigorif*
*
*
That is also identical in both versions, *except for the token positions*.
Notice how 'e' disappears, because of being a stopword.

In SOLR 3.x the positions are: 1, 2, *3*, 4
In SOLR 4.x the positions are: 1, 2, *4*, 5

Could that be the problem?

I've posted a question before here: phrase queries on
punctuationhttp://stackoverflow.com/questions/15314460/solr-generates-phrase-queries-on-punctuation
which I believe that, with the issue with token positions, is causing the
discrepancies.

I couldn't found any documentation/changelog about token positions with
stopwords, hell, I can barely google SOLR-4 specific things.
Can this be solved?

I whish i could fix the original StackOverflow answer (prevent phrase query
generation with punctuation), but I could live with fixing the token
position thing at least (remember that if things work as before, then I am
able to upgrade to 4.x).

Thank you in advance

PS: just in case I'm adding the schema (version=1.5) part:

fieldtype name=text_pt class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=- replacement=IIIHYPHENIII/tokenizer
class=solr.StandardTokenizerFactory/ filter
class=solr.PatternReplaceFilterFactory pattern=IIIHYPHENIII
replacement=-/filter
class=solr.ASCIIFoldingFilterFactory /filter
class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 preserveOriginal=1 catenateWords=1
catenateNumbers=1 catenateAll=0/filter
class=solr.LowerCaseFilterFactory/filter
class=solr.StopFilterFactory ignoreCase=false
words=portugueseStopWords.txt/
filter class=solr.BrazilianStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzeranalyzer type=query  charFilter
class=solr.PatternReplaceCharFilterFactory pattern=-
replacement=IIIHYPHENIII/tokenizer
class=solr.StandardTokenizerFactory/ filter
class=solr.PatternReplaceFilterFactory pattern=IIIHYPHENIII
replacement=-/filter
class=solr.ASCIIFoldingFilterFactory /  filter
class=solr.SynonymFilterFactory ignoreCase=true
synonyms=portugueseSynonyms.txt expand=true/filter
class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=0 preserveOriginal=1
catenateNumbers=0 catenateAll=0 protected=protwords.txt/
filter class=solr.LowerCaseFilterFactory/filter
class=solr.StopFilterFactory ignoreCase=false
words=portugueseStopWords.txt/   filter
class=solr.BrazilianStemFilterFactory/filter
class=solr.RemoveDuplicatesTokenFilterFactory//analyzer
/fieldtype


Re: How to Manage RAM Usage at Heavy Indexing

2013-09-09 Thread Furkan KAMACI
Is there anything says something about that bug?


2013/8/28 Dan Davis dansm...@gmail.com

 This could be an operating systems problem rather than a Solr problem.
 CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing
 and I would read-up up on that.
 The VM parameters can be tuned in /etc/sysctl.conf


 On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

  Hi Erick;
 
  I wanted to get a quick answer that's why I asked my question as that
 way.
 
  Error is as follows:
 
  INFO  - 2013-08-21 22:01:30.978;
  org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
  webapp=/solr path=/update params={wt=javabinversion=2}
  {add=[com.deviantart.reachmeh
  ere:http/gallery/, com.deviantart.reachstereo:http/,
  com.deviantart.reachstereo:http/art/SE-mods-313298903,
  com.deviantart.reachtheclouds:http/,
 com.deviantart.reachthegoddess:http/,
  co
  m.deviantart.reachthegoddess:http/art/retouched-160219962,
  com.deviantart.reachthegoddess:http/badges/,
  com.deviantart.reachthegoddess:http/favourites/,
  com.deviantart.reachthetop:http/
  art/Blue-Jean-Baby-82204657 (1444006227844530177),
  com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790
  ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException;
  java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
  early EOF
  at
 
 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
  at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
  at
 
 
 com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
  at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
  at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
  at
 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
  at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
  at
 
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  at
 
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
  at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
  at
 
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
  at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
  at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
  at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
  at
  org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
  at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
  at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
  at
 
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
  at
 
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
  at org.eclipse.jetty.server.Server.handle(Server.java:365)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
  at
 
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948)
  at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
  at
 
 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
  at
 
 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
  at
 
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
  at
 
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
  at java.lang.Thread.run(Thread.java:722)
  Caused by: org.eclipse.jetty.io.EofException: early EOF
  at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65)
  at java.io.InputStream.read(InputStream.java:101)
  at 

Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?

2013-09-09 Thread diyun2008
Thank you Yago. That seems some strange. Do you know some official document
detail this? I really need more evidence to do dicision.I mean I need to
compare the two method and find out which have more advantages in terms of
performance and cost. And I will change my parameter to do more testing. I
have 15K collections at least . If you have more experiences, I'm very
appreciated to get more advices from you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088873.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Connection Established but waiting for response for a long time.

2013-09-09 Thread qungg
 Set name=ThreadPool
  
  New class=org.eclipse.jetty.util.thread.QueuedThreadPool
Set name=minThreads10/Set
Set name=maxThreads1/Set
Set name=detailedDumpfalse/Set
  /New
/Set

Call name=addConnector
  Arg
  New class=org.eclipse.jetty.server.bio.SocketConnector
Set name=hostSystemProperty name=jetty.host //Set
Set name=portSystemProperty name=jetty.port
default=8983//Set
Set name=maxIdleTime5000/Set
Set name=requestHeaderSize65536/Set
Set name=lowResourceMaxIdleTime1500/Set
Set name=statsOnfalse/Set
  /New
  /Arg
/Call


Everything is default expect for 
Set name=maxIdleTime5000/Set
Set name=requestHeaderSize65536/Set

Thanks,
Qun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Connection-Established-but-waiting-for-response-for-a-long-time-tp4088587p4088874.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?

2013-09-09 Thread diyun2008
I just found this option -Djute.maxbuffer in zookeeper admin document. But
it's a Unsafe Options. I can't really know what it mean. Maybe that will
bring some unstable problems? Does someone have some real practical
experiences when using this parameter? I will have at least 15K collections.
Or I will have to merge them to small numbers.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088878.html
Sent from the Solr - User mailing list archive at Nabble.com.


Stemming and protwords configuration

2013-09-09 Thread csicard.ext
Hi,

We have a Solr server using stemming:

filter class=solr.SnowballPorterFilterFactory language=French 
protected=protwords.txt /

I would like to query the French words frais and fraise separately. I put 
the word fraise in protwords.txt file.

- When I query the word fraise, no document indexed with the word frais are 
found.
- When I query the word frais, I've got documents indexed with the word 
fraise.

Is there a way to do not match fraises documents in the second situation ?

I hope this is clear. Thanks for your reply.

Christophe


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.



Re: collections api setting dataDir

2013-09-09 Thread Shawn Heisey
On 9/7/2013 2:25 PM, mike st. john wrote:
 yes the collections api ignored it,what i ended up doing, was just
 building out some fairness in regards to creating the cores and calling
 coreadmin to create the cores, seemed to work ok.   Only issue i'm having
 now, and i'm still investigating is subsequent queries are returning
 different counts.

Every time I have seen distributed queries return different counts on
different runs, it is because documents with the same value in the
UniqueKey field exist in more than one shard.  If you are letting
SolrCloud route your documents automatically, this shouldn't happen ...
but if you are using distrib=false or a router that doesn't do it
automatically, then it could.

The Collections API doesn't do the dataDir parameter.  I suspect this is
because you could pass an absolute path in, which would break things
because every core would be trying to use the same dataDir.  If you want
a directory other than ${instanceDir}/data for dataDir, then you will
need to create each core individually rather than use the Collections API.

Java does have the capability to determine whether a path is relative or
absolute, but it is safer to just ignore that parameter, especially
given the fact that a single cloud is usually on many servers, and
there's no reason those servers can't be running wildly different
operating systems.  Half your cloud could be on a Linux/UNIX OS and half
of it could be on Windows.

I personally find it better to let the Collections API do its thing and
use the default.

Thanks,
Shawn



Re: collections api setting dataDir

2013-09-09 Thread mike st. john
hi,

i've sorted it all out.  basically a  few replicas had failed and the
counts on the replicas were less than the leader.,  i basically killed the
index on those replicas and let them recover.


Thanks for the  help.

msj


On Mon, Sep 9, 2013 at 11:08 AM, Shawn Heisey s...@elyograg.org wrote:

 On 9/7/2013 2:25 PM, mike st. john wrote:
  yes the collections api ignored it,what i ended up doing, was just
  building out some fairness in regards to creating the cores and calling
  coreadmin to create the cores, seemed to work ok.   Only issue i'm having
  now, and i'm still investigating is subsequent queries are returning
  different counts.

 Every time I have seen distributed queries return different counts on
 different runs, it is because documents with the same value in the
 UniqueKey field exist in more than one shard.  If you are letting
 SolrCloud route your documents automatically, this shouldn't happen ...
 but if you are using distrib=false or a router that doesn't do it
 automatically, then it could.

 The Collections API doesn't do the dataDir parameter.  I suspect this is
 because you could pass an absolute path in, which would break things
 because every core would be trying to use the same dataDir.  If you want
 a directory other than ${instanceDir}/data for dataDir, then you will
 need to create each core individually rather than use the Collections API.

 Java does have the capability to determine whether a path is relative or
 absolute, but it is safer to just ignore that parameter, especially
 given the fact that a single cloud is usually on many servers, and
 there's no reason those servers can't be running wildly different
 operating systems.  Half your cloud could be on a Linux/UNIX OS and half
 of it could be on Windows.

 I personally find it better to let the Collections API do its thing and
 use the default.

 Thanks,
 Shawn




Re: Data import

2013-09-09 Thread Luís Portela Afonso
When i run  dataimport/?command=full-importclean=false, solr add new 
documents with the information. But if the same information already exists with 
the same uniquekey, it replaces the existing document with a new one.
It does not update the document, it creates a new one. It's that possible?

I'm indexing rss feeds. I run the rss example that exists in the solr examples, 
and i does that.

On Sep 9, 2013, at 4:10 AM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 What do you specifically mean by the disable document update? Do you mean
 in-place update? Or do you mean you want to run the import but not actually
 populate Solr collection with processed documents?
 
 It might help to explain the business level goal you are trying to achieve.
 Or, specific error that you are perhaps seeing and trying to avoid.
 
 Regards,
   Alex.
 
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Mon, Sep 9, 2013 at 6:42 AM, Luís Portela Afonso
 meligalet...@gmail.comwrote:
 
 Hi,
 
 It's possible to disable document update when running data import,
 full-import command?
 
 Thanks



smime.p7s
Description: S/MIME cryptographic signature


Re: How to Manage RAM Usage at Heavy Indexing

2013-09-09 Thread Shawn Heisey

On 9/9/2013 10:35 AM, P Williams wrote:

Is it odd that my index is ~16GB but top shows 30GB in virtual memory?
  Would the extra be for the field and filter caches I've increased in size?


This should probably be a new thread, but it might have some 
applicability here, so I'm replying.


I have noticed some inconsistencies in memory reporting on Linux with 
regard to Solr.  Here's a screenshot of top on one of my production 
systems, sorted by memory:


https://www.dropbox.com/s/ylxm0qlcegithzc/prod-top-sort-mem.png

The virtual memory size for the top process is right in line with my 
index size, plus a few gig for the java heap.  Something to note as you 
ponder these numbers: My java heap is only 6GB.  Java has allocated the 
entire 6GB.  The other two java processes are homegrown Solr-related 
applications.


What's odd is the resident and shared memory sizes.  I have pretty much 
convinced myself that the shared memory size is misreported.  If you add 
up the numbers for cached and free, you get a total of 53659264 ... 
about 11GB shy of the 64GB total memory.


if the reported resident memory for the Solr java process (17GB) were 
accurate, this would exceed total physical memory by several gigabytes, 
and there would be swap in use, but as you can see, there is no swap in use.


Recently I overheard a conversation between Lucene committers in a 
lucene IRC channel that seemed to be discussing this phenomenon.  There 
is apparently some issue with certain mmap modes that result in the 
operating system shared memory number going up even though no actual 
memory is being consumed.


Thanks,
Shawn



Re: Searching solr on school name during year

2013-09-09 Thread tamanjit.bin...@yahoo.co.in
You could either add two separate fields, one for start year and another for
end year. And then facilitate range queries to include all docs.
eg. Name - Boris
start year - 2001
end year - 2005



Or you could just have one field and put in multivalued years a student has
attended the school.
name  -boris
year 2001
   2002
   2003
  2004
  2005

I think the second approach would complete your objective



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-solr-on-school-name-during-year-tp4088817p4088910.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data import

2013-09-09 Thread tamanjit.bin...@yahoo.co.in
Any form of indexing would always replace a document and never update it.
If you dont want replacements dont use a unique key in your schema and sort
on time/date etc.
 
But i still dont get one thing, if i have two indexes that i try to merge
and both the indexes have some documents with same unique ids, they dont
overwrite each other. Instead what i have is two documents with same unique
id. Why does this happen? Anyone any clues?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-import-tp4088789p4088921.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr suggest - How to define solr suggest as case insensitive

2013-09-09 Thread tamanjit.bin...@yahoo.co.in
This is probably because your dictionary is made up of all lower case tokens,
but when you query the spell-checker similar analysis doesnt happen. Ideal
case would be when you query the spellchecker you send lower case queries



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-suggest-How-to-define-solr-suggest-as-case-insensitive-tp4088764p4088918.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr suggestion -

2013-09-09 Thread tamanjit.bin...@yahoo.co.in
Don't do any analysis on the field you are using for suggestion. What is
happening here is that query time and indexing time the tokens are being
broken on white space. So effectively, at is being taken as one token and
l is being taken as another token for which you get two different
suggestions.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-suggestion-tp4087841p4088919.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to Manage RAM Usage at Heavy Indexing

2013-09-09 Thread P Williams
Hi,

I've been seeing the same thing on CentOS with high physical memory use
with low JVM-Memory use.  I came to the conclusion that this was expected
behaviour.  Using top I noticed that my solr user's java process has
Virtual memory allocated of about twice the size of the index, actual is
within the limits I set when jetty starts.  I infer from this that 98% of
Physical Memory is being used to cache the index.  Walter, Erick and others
are constantly reminding people on list to have RAM the size of the index
available -- I think 98% physical memory use is exactly why.  Here is an
excerpt from Uwe Schindler's well written
piecehttp://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.htmlwhich
explains in greater detail:

*Basically mmap does the same like handling the Lucene index as a swap
file. The mmap() syscall tells the O/S kernel to virtually map our whole
index files into the previously described virtual address space, and make
them look like RAM available to our Lucene process. We can then access our
index file on disk just like it would be a large byte[] array (in Java this
is encapsulated by a ByteBuffer interface to make it safe for use by Java
code). If we access this virtual address space from the Lucene code we
don’t need to do any syscalls, the processor’s MMU and TLB handles all the
mapping for us. If the data is only on disk, the MMU will cause an
interrupt and the O/S kernel will load the data into file system cache. If
it is already in cache, MMU/TLB map it directly to the physical memory in
file system cache. It is now just a native memory access, nothing more! We
don’t have to take care of paging in/out of buffers, all this is managed by
the O/S kernel. Furthermore, we have no concurrency issue, the only
overhead over a standard byte[] array is some wrapping caused by
Java’s ByteBuffer
interface (it is still slower than a real byte[] array, but that is the
only way to use mmap from Java and is much faster than all other directory
implementations shipped with Lucene). We also waste no physical memory, as
we operate directly on the O/S cache, avoiding all Java GC issues described
before.*
*
*
Is it odd that my index is ~16GB but top shows 30GB in virtual memory?
 Would the extra be for the field and filter caches I've increased in size?

I went through a few Java tuning steps relating to OutOfMemoryErrors when
using DataImportHandler with Solr.  The first thing is that when using the
FileEntityProcessor for each file in the file system to be indexed an entry
is made and stored in heap before any indexing actually occurs.  When I
started pointing this at very large directories I started running out of
heap.  One work-around is to divide the job up into smaller batches, but I
was able to allocate more memory so that everything fit.  The next thing is
that with more memory allocated the limiting factor was too many open
files.  After allowing the solr user to open more files I was able to get
past this as well.  There was a sweet spot where indexing with just enough
memory was slow enough that I didn't experience the too many open files
error but why go slow?  Now I'm able to index ~4M documents (newspaper
articles and fulltext monographs) in about 7 hours.

I hope someone will correct me if I'm wrong about anything I've said here
and especially if there is a better way to do things.

Best of luck,
Tricia



On Wed, Aug 28, 2013 at 12:12 PM, Dan Davis dansm...@gmail.com wrote:

 This could be an operating systems problem rather than a Solr problem.
 CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing
 and I would read-up up on that.
 The VM parameters can be tuned in /etc/sysctl.conf


 On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

  Hi Erick;
 
  I wanted to get a quick answer that's why I asked my question as that
 way.
 
  Error is as follows:
 
  INFO  - 2013-08-21 22:01:30.978;
  org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
  webapp=/solr path=/update params={wt=javabinversion=2}
  {add=[com.deviantart.reachmeh
  ere:http/gallery/, com.deviantart.reachstereo:http/,
  com.deviantart.reachstereo:http/art/SE-mods-313298903,
  com.deviantart.reachtheclouds:http/,
 com.deviantart.reachthegoddess:http/,
  co
  m.deviantart.reachthegoddess:http/art/retouched-160219962,
  com.deviantart.reachthegoddess:http/badges/,
  com.deviantart.reachthegoddess:http/favourites/,
  com.deviantart.reachthetop:http/
  art/Blue-Jean-Baby-82204657 (1444006227844530177),
  com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790
  ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException;
  java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
  early EOF
  at
 
 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
  at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
  at
 
 
 

Re: Profiling Solr Lucene for query

2013-09-09 Thread Manuel Le Normand
Hi Dmitry,

I have solr 4.3 and every query is distributed and merged back for ranking
purpose.

What do you mean by frontend solr?


On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote:

 are you querying your shards via a frontend solr? We have noticed, that
 querying becomes much faster if results merging can be avoided.

 Dmitry


 On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

  Hello all
  Looking on the 10% slowest queries, I get very bad performances (~60 sec
  per query).
  These queries have lots of conditions on my main field (more than a
  hundred), including phrase queries and rows=1000. I do return only id's
  though.
  I can quite firmly say that this bad performance is due to slow storage
  issue (that are beyond my control for now). Despite this I want to
 improve
  my performances.
 
  As tought in school, I started profiling these queries and the data of ~1
  minute profile is located here:
  http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
 
  Main observation: most of the time I do wait for readVInt, who's
 stacktrace
  (2 out of 2 thread dumps) is:
 
  catalina-exec-3870 - Thread t@6615
   java.lang.Thread.State: RUNNABLE
   at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
   at
 
 
 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
  2357)
   at
 
 
 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
   at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
   at
 
 
 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
   at
 org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
   at
 
 
 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
   at
 
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
  oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
   at
 
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
   at
  org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
   at
 
 
 org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
 
 
  So I do actually wait for IO as expected, but I might be too many time
 page
  faulting while looking for the TermBlocks (tim file), ie locating the
 term.
  As I reindex now, would it be useful lowering down the termInterval
  (default to 128)? As the FST (tip files) are that small (few 10-100 MB)
 so
  there are no memory contentions, could I lower down this param to 8 for
  example? The benefit from lowering down the term interval would be to
  obligate the FST to get on memory (JVM - thanks to the
 NRTCachingDirectory)
  as I do not control the term dictionary file (OS caching, loads an
 average
  of 6% of it).
 
 
  General configs:
  solr 4.3
  36 shards, each has few million docs
  These 36 servers (each server has 2 replicas) are running virtual, 16GB
  memory each (4GB for JVM, 12GB remain for the OS caching),  consuming
 260GB
  of disk mounted for the index files.
 



Re: Expunge deleting using excessive transient disk space

2013-09-09 Thread Manuel Le Normand
I can only agree for the 50% free space recommendation. Unfortunately I do
not have this for the current time, I'm standing on a 10% free disk (out of
300GB for each server). I'm aware it is very low.

Does this seem reasonable adapting the current merge policy (or writing a
new one) that would free up the transient disk space every merge instead of
waiting for all of them to achieve? Where can I get such a answer (people
who wrote the code)?

Thanks


On Sun, Sep 8, 2013 at 9:30 PM, Erick Erickson erickerick...@gmail.comwrote:

 Right, but you should have at least as much free space as your total index
 size, and I don't see the total index size (but I'm just glancing).

 I'm not entirely sure you can precisely calculate the maximum free space
 you have relative to the amount needed for merging, some of the people who
 wrote that code can probably tell you more.

 I'd _really_ try to get more disk space. The amount of engineer time spent
 trying to tune this is way more expensive than a disk...

 Best,
 Erick


 On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:

Hi,
  In order to delete part of my index I run a delete by query that intends
 to
  erase 15% of the docs.
  I added this params to the solrconfig.xml
  mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 int name=maxMergeAtOnce2/int
 int name=maxMergeAtOnceExplicit2/int
 double name=maxMergedSegmentMB5000.0/double
 double name=reclaimDeletesWeight10.0/double
 double name=segmentsPerTier15.0/double
  /mergePolicy
 
  The extra params were added in order to promote merge of old segments but
  with restriction on the transient disk that can be used (as I have only
  15GB per shard).
 
  This procedure failed on a no space left on device exception, although
  proper calculations show that these params should cause no usage excess
 of
  the transient free disk space I have.
   Looking on the infostream I can see that the first merges do succeed but
  older segments are kept in reference thus cannot be deleted until all the
  merging are done.
 
  Is there anyway of overcoming this?
 



Re: Facet Sort with non ASCII Characters

2013-09-09 Thread Yonik Seeley
On Mon, Sep 9, 2013 at 7:16 AM, Sandro Zbinden zbin...@imagic.ch wrote:
 Is there a plan to add support for alphabetical facet sorting with non ASCII 
 Characters ?

The entire unicode range should already work.  Can you give an example
of what you would like to see?

-Yonik
http://lucidworks.com


Re: Expunge deleting using excessive transient disk space

2013-09-09 Thread Walter Underwood
10% free space is guaranteed to cause problems. That is a faulty installation.

Explain to ops that Solr needs double the minimum index size. This is required 
for normal operation. That isn't extra, it is required for merges. Solr makes 
copies instead of doing record locking. The merge design is essential for speed.

If they don't provide that, it will break, and it will be their fault.

If they don't want to provide that, they need a different search engine.

Adapting the merge policy to work with only 10% free space is not reasonable. 
When one segment is bigger than 10% (and it will be), merging that segment will 
fail.

wunder

On Sep 9, 2013, at 12:24 PM, Manuel Le Normand wrote:

 I can only agree for the 50% free space recommendation. Unfortunately I do
 not have this for the current time, I'm standing on a 10% free disk (out of
 300GB for each server). I'm aware it is very low.
 
 Does this seem reasonable adapting the current merge policy (or writing a
 new one) that would free up the transient disk space every merge instead of
 waiting for all of them to achieve? Where can I get such a answer (people
 who wrote the code)?
 
 Thanks
 
 
 On Sun, Sep 8, 2013 at 9:30 PM, Erick Erickson erickerick...@gmail.comwrote:
 
 Right, but you should have at least as much free space as your total index
 size, and I don't see the total index size (but I'm just glancing).
 
 I'm not entirely sure you can precisely calculate the maximum free space
 you have relative to the amount needed for merging, some of the people who
 wrote that code can probably tell you more.
 
 I'd _really_ try to get more disk space. The amount of engineer time spent
 trying to tune this is way more expensive than a disk...
 
 Best,
 Erick
 
 
 On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand 
 manuel.lenorm...@gmail.com wrote:
 
  Hi,
 In order to delete part of my index I run a delete by query that intends
 to
 erase 15% of the docs.
 I added this params to the solrconfig.xml
 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce2/int
   int name=maxMergeAtOnceExplicit2/int
   double name=maxMergedSegmentMB5000.0/double
   double name=reclaimDeletesWeight10.0/double
   double name=segmentsPerTier15.0/double
 /mergePolicy
 
 The extra params were added in order to promote merge of old segments but
 with restriction on the transient disk that can be used (as I have only
 15GB per shard).
 
 This procedure failed on a no space left on device exception, although
 proper calculations show that these params should cause no usage excess
 of
 the transient free disk space I have.
 Looking on the infostream I can see that the first merges do succeed but
 older segments are kept in reference thus cannot be deleted until all the
 merging are done.
 
 Is there anyway of overcoming this?
 
 

--
Walter Underwood
wun...@wunderwood.org





Re: Profiling Solr Lucene for query

2013-09-09 Thread Mikhail Khludnev
Hello Manuel,

1 minute sampling brings too few data. Lowering termindex should help,
however I don't know how FST really behaves on in. It definitely helped at
3.x;
Would you mind if I ask which OS you have and which Directory
implementation is used actually?


On Sun, Sep 8, 2013 at 7:56 PM, Manuel Le Normand 
manuel.lenorm...@gmail.com wrote:

 Hello all
 Looking on the 10% slowest queries, I get very bad performances (~60 sec
 per query).
 These queries have lots of conditions on my main field (more than a
 hundred), including phrase queries and rows=1000. I do return only id's
 though.
 I can quite firmly say that this bad performance is due to slow storage
 issue (that are beyond my control for now). Despite this I want to improve
 my performances.

 As tought in school, I started profiling these queries and the data of ~1
 minute profile is located here:
 http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg

 Main observation: most of the time I do wait for readVInt, who's stacktrace
 (2 out of 2 thread dumps) is:

 catalina-exec-3870 - Thread t@6615
  java.lang.Thread.State: RUNNABLE
  at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
  at

 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
 2357)
  at

 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
  at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
  at

 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
  at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
  at

 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
  at
 org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
  at

 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
  at
 oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
  at

 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
  at
 org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
  at

 org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)


 So I do actually wait for IO as expected, but I might be too many time page
 faulting while looking for the TermBlocks (tim file), ie locating the term.
 As I reindex now, would it be useful lowering down the termInterval
 (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so
 there are no memory contentions, could I lower down this param to 8 for
 example? The benefit from lowering down the term interval would be to
 obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory)
 as I do not control the term dictionary file (OS caching, loads an average
 of 6% of it).


 General configs:
 solr 4.3
 36 shards, each has few million docs
 These 36 servers (each server has 2 replicas) are running virtual, 16GB
 memory each (4GB for JVM, 12GB remain for the OS caching),  consuming 260GB
 of disk mounted for the index files.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: charfilter doesn't do anything

2013-09-09 Thread Jack Krupansky

Did you in fact try my suggested example? If not, please do so.

-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Monday, September 09, 2013 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i index html pages with a lot of lines and not just a string with the 
body-tag.
it doesn't work with proper html files, even though i took all the new lines 
out.


html-file:
htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html

solr update debug output:
text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; 
charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
will ich sehenfooter-content/body/html]




On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

I tried this and it seems to work when added to the standard Solr example 
in 4.4:


field name=body type=text_html_body indexed=true stored=true /

fieldType name=text_html_body class=solr.TextField 
positionIncrementGap=100 

analyzer
  charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /

  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

That char filter retains only text between body and /body. Is that 
what you wanted?


Indexing this data:

curl 'localhost:8983/solr/update?commit=true' -H 
'Content-type:application/json' -d '

[{id:doc-1,body:abc bodyA test./body def}]'

And querying with these commands:

curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
Shows all data

curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
shows the body text

curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
Shows nothing, HTML tag stripped

In your original query, you didn't show us what your default field, df 
parameter, was.


-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Sunday, September 08, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:


Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the body, perhaps, but might it help?


On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:


ok i have html pages with html.!--body--content i
want!--/body--./html. i want to extract (index, store) only
that between the body-comments. i thought regexTransformer would be the
best because xpath doesn't work in tika and i cant nest a
xpathEntetyProcessor to use xpath. what i have also found out is that 
the

htmlparser from tika cuts my body-comments out and tries to make well
formed html, which i would like to switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:


On 9/6/2013 7:09 AM, Andreas Owen wrote:
i've managed to get it working if i use the regexTransformer and 
string
is on the same line in my tika entity. but when the string is multilined 
it

isn't working even though i tried ?s to set the flag dotall.


entity name=tika processor=TikaEntityProcessor url=${rec.url}

dataSource=dataUrl onError=skip htmlMapper=identity format=html
transformer=RegexTransformer

   field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;

replaceWith=QQQ sourceColName=text  /

/entity

then i tried it like this and i get a stackoverflow

field column=text_html 
regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;

replaceWith=QQQ sourceColName=text  /


in javascript this works but maybe because i only used a small string.


Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out.  Perhaps the HTMLStripCharFilter?



http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory


Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input.  The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.



http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory


Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results.  If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn






eDismax Phrase Field Boosts on Single Terms

2013-09-09 Thread Jeff Porter
I am curious how the dismay parser handles single term queries and phrase 
boosts.  For example, if I had a query

q=bars   with the following dismax parameters:  qf=categories and 
pf=categories^100  

I would expect that the parser would match on the QF parameter but then also 
match again on the PF parameter and apply the boost.  I am not seeing this.  
Should I be?  


The reason I was trying to avoid applying both a QF and PF boost is because i 
do want to boost on values like:  Bars and Restaurants and a PF boost makes 
the most sense here rather than boosting on any documents that contain Bar or 
Restaurant

Thanks in advance.

Jeff




Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen
i index html pages with a lot of lines and not just a string with the body-tag. 
it doesn't work with proper html files, even though i took all the new lines 
out.

html-file:
htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html

solr update debug output:
text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; 
charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will 
ich sehenfooter-content/body/html]



On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

 I tried this and it seems to work when added to the standard Solr example in 
 4.4:
 
 field name=body type=text_html_body indexed=true stored=true /
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 That char filter retains only text between body and /body. Is that what 
 you wanted?
 
 Indexing this data:
 
 curl 'localhost:8983/solr/update?commit=true' -H 
 'Content-type:application/json' -d '
 [{id:doc-1,body:abc bodyA test./body def}]'
 
 And querying with these commands:
 
 curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
 Shows all data
 
 curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
 shows the body text
 
 curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
 Shows nothing, HTML tag stripped
 
 In your original query, you didn't show us what your default field, df 
 parameter, was.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, September 08, 2013 5:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 yes but that filter html and not the specific tag i want.
 
 On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
 
 Hmmm, have you looked at:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Not quite the body, perhaps, but might it help?
 
 
 On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:
 
 ok i have html pages with html.!--body--content i
 want!--/body--./html. i want to extract (index, store) only
 that between the body-comments. i thought regexTransformer would be the
 best because xpath doesn't work in tika and i cant nest a
 xpathEntetyProcessor to use xpath. what i have also found out is that the
 htmlparser from tika cuts my body-comments out and tries to make well
 formed html, which i would like to switch off.
 
 On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
 
 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string
 is on the same line in my tika entity. but when the string is multilined it
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url}
 dataSource=dataUrl onError=skip htmlMapper=identity format=html
 transformer=RegexTransformer
field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 /entity
 
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a solution for you?
 
 It sounds a little bit like you're interested in stripping all the HTML
 tags out.  Perhaps the HTMLStripCharFilter?
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Something that I already said: By using the KeywordTokenizer, you won't
 be able to search for individual words on your HTML input.  The entire
 input string is treated as a single token, and therefore ONLY exact
 entire-field matches (or certain wildcard matches) will be possible.
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
 
 Note that no matter what you do to your data with the analysis chain,
 Solr will always return the text that was originally indexed in search
 results.  If you need to affect what gets stored as well, perhaps you
 need an Update Processor.
 
 Thanks,
 Shawn
 



Re: Profiling Solr Lucene for query

2013-09-09 Thread Dmitry Kan
Hi Manuel,

The frontend solr instance is the one that does not have its own index and
is doing merging of the results. Is this the case? If yes, are all 36
shards always queried?

Dmitry


On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand 
manuel.lenorm...@gmail.com wrote:

 Hi Dmitry,

 I have solr 4.3 and every query is distributed and merged back for ranking
 purpose.

 What do you mean by frontend solr?


 On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote:

  are you querying your shards via a frontend solr? We have noticed, that
  querying becomes much faster if results merging can be avoided.
 
  Dmitry
 
 
  On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand 
  manuel.lenorm...@gmail.com wrote:
 
   Hello all
   Looking on the 10% slowest queries, I get very bad performances (~60
 sec
   per query).
   These queries have lots of conditions on my main field (more than a
   hundred), including phrase queries and rows=1000. I do return only id's
   though.
   I can quite firmly say that this bad performance is due to slow storage
   issue (that are beyond my control for now). Despite this I want to
  improve
   my performances.
  
   As tought in school, I started profiling these queries and the data of
 ~1
   minute profile is located here:
   http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
  
   Main observation: most of the time I do wait for readVInt, who's
  stacktrace
   (2 out of 2 thread dumps) is:
  
   catalina-exec-3870 - Thread t@6615
java.lang.Thread.State: RUNNABLE
at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
at
  
  
 
 org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
   2357)
at
  
  
 
 ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
at
  
  
 
 org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221)
at
  org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
at
  
  
 
 org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
at
  
 org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
at
  
  
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
at
  
 oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
at
  
  
 
 org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183)
at
  
 org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
at
  
  
 
 org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
  
  
   So I do actually wait for IO as expected, but I might be too many time
  page
   faulting while looking for the TermBlocks (tim file), ie locating the
  term.
   As I reindex now, would it be useful lowering down the termInterval
   (default to 128)? As the FST (tip files) are that small (few 10-100 MB)
  so
   there are no memory contentions, could I lower down this param to 8 for
   example? The benefit from lowering down the term interval would be to
   obligate the FST to get on memory (JVM - thanks to the
  NRTCachingDirectory)
   as I do not control the term dictionary file (OS caching, loads an
  average
   of 6% of it).
  
  
   General configs:
   solr 4.3
   36 shards, each has few million docs
   These 36 servers (each server has 2 replicas) are running virtual, 16GB
   memory each (4GB for JVM, 12GB remain for the OS caching),  consuming
  260GB
   of disk mounted for the index files.
  
 



Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen
i tried but that isn't working either, it want a data-stream, i'll have to 
check how to post json instead of xml

On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

 Did you at least try the pattern I gave you?
 
 The point of the curl was the data, not how you send the data. You can just 
 use the standard Solr simple post tool.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 6:40 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i've downloaded curl and tried it in the comman prompt and power shell on my 
 win 2008r2 server, thats why i used my dataimporter with a single line html 
 file and copy/pastet the lines into schema.xml
 
 
 On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
 
 Did you in fact try my suggested example? If not, please do so.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 4:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i index html pages with a lot of lines and not just a string with the 
 body-tag.
 it doesn't work with proper html files, even though i took all the new lines 
 out.
 
 html-file:
 htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html
 
 solr update debug output:
 text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
 content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; 
 charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
 will ich sehenfooter-content/body/html]
 
 
 
 On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
 
 I tried this and it seems to work when added to the standard Solr example 
 in 4.4:
 
 field name=body type=text_html_body indexed=true stored=true /
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 That char filter retains only text between body and /body. Is that what 
 you wanted?
 
 Indexing this data:
 
 curl 'localhost:8983/solr/update?commit=true' -H 
 'Content-type:application/json' -d '
 [{id:doc-1,body:abc bodyA test./body def}]'
 
 And querying with these commands:
 
 curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
 Shows all data
 
 curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
 shows the body text
 
 curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
 Shows nothing, HTML tag stripped
 
 In your original query, you didn't show us what your default field, df 
 parameter, was.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, September 08, 2013 5:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 yes but that filter html and not the specific tag i want.
 
 On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
 
 Hmmm, have you looked at:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Not quite the body, perhaps, but might it help?
 
 
 On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:
 
 ok i have html pages with html.!--body--content i
 want!--/body--./html. i want to extract (index, store) only
 that between the body-comments. i thought regexTransformer would be the
 best because xpath doesn't work in tika and i cant nest a
 xpathEntetyProcessor to use xpath. what i have also found out is that the
 htmlparser from tika cuts my body-comments out and tries to make well
 formed html, which i would like to switch off.
 
 On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
 
 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string
 is on the same line in my tika entity. but when the string is multilined 
 it
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url}
 dataSource=dataUrl onError=skip htmlMapper=identity format=html
 transformer=RegexTransformer
  field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 /entity
 
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a 

Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-09 Thread Chris Hostetter
: Subject: Re: unknown _stream_source_info while indexing rich doc in solr
: 
: Error  got resolved,thanks a lot  Sir.I have been trying since days to
:  resolve it.

Usersn't shouldn't have to worry about problems like this ... i'll try to 
make this less error prone...

https://issues.apache.org/jira/browse/SOLR-5228


-Hoss


Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen
i've downloaded curl and tried it in the comman prompt and power shell on my 
win 2008r2 server, thats why i used my dataimporter with a single line html 
file and copy/pastet the lines into schema.xml


On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:

 Did you in fact try my suggested example? If not, please do so.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 4:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i index html pages with a lot of lines and not just a string with the 
 body-tag.
 it doesn't work with proper html files, even though i took all the new lines 
 out.
 
 html-file:
 htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html
 
 solr update debug output:
 text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
 content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; 
 charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
 will ich sehenfooter-content/body/html]
 
 
 
 On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
 
 I tried this and it seems to work when added to the standard Solr example in 
 4.4:
 
 field name=body type=text_html_body indexed=true stored=true /
 
 fieldType name=text_html_body class=solr.TextField 
 positionIncrementGap=100 
 analyzer
  charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 That char filter retains only text between body and /body. Is that what 
 you wanted?
 
 Indexing this data:
 
 curl 'localhost:8983/solr/update?commit=true' -H 
 'Content-type:application/json' -d '
 [{id:doc-1,body:abc bodyA test./body def}]'
 
 And querying with these commands:
 
 curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
 Shows all data
 
 curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
 shows the body text
 
 curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
 shows nothing (outside of body)
 
 curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
 Shows nothing, HTML tag stripped
 
 In your original query, you didn't show us what your default field, df 
 parameter, was.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Sunday, September 08, 2013 5:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 yes but that filter html and not the specific tag i want.
 
 On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
 
 Hmmm, have you looked at:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Not quite the body, perhaps, but might it help?
 
 
 On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:
 
 ok i have html pages with html.!--body--content i
 want!--/body--./html. i want to extract (index, store) only
 that between the body-comments. i thought regexTransformer would be the
 best because xpath doesn't work in tika and i cant nest a
 xpathEntetyProcessor to use xpath. what i have also found out is that the
 htmlparser from tika cuts my body-comments out and tries to make well
 formed html, which i would like to switch off.
 
 On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
 
 On 9/6/2013 7:09 AM, Andreas Owen wrote:
 i've managed to get it working if i use the regexTransformer and string
 is on the same line in my tika entity. but when the string is multilined it
 isn't working even though i tried ?s to set the flag dotall.
 
 entity name=tika processor=TikaEntityProcessor url=${rec.url}
 dataSource=dataUrl onError=skip htmlMapper=identity format=html
 transformer=RegexTransformer
   field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 /entity
 
 then i tried it like this and i get a stackoverflow
 
 field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
 replaceWith=QQQ sourceColName=text  /
 
 in javascript this works but maybe because i only used a small string.
 
 Sounds like we've got an XY problem here.
 
 http://people.apache.org/~hossman/#xyproblem
 
 How about you tell us *exactly* what you'd actually like to have happen
 and then we can find a solution for you?
 
 It sounds a little bit like you're interested in stripping all the HTML
 tags out.  Perhaps the HTMLStripCharFilter?
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
 Something that I already said: By using the KeywordTokenizer, you won't
 be able to search for individual words on your HTML input.  The entire
 input string is treated as a single token, and therefore ONLY exact
 entire-field matches (or certain wildcard matches) will be possible.
 
 
 

Re: Facet sort descending

2013-09-09 Thread Chris Hostetter

: Is there a plan to add a descending sort order for facet queries ?
: Best regards Sandro

I don't understand your question.

if you specify multiple facet.query params, then the constraint counts are 
returned in the order they were initially specified -- there is no need 
for server side sorting, because they all come back (as opposed to 
facet.field where the number of constraints can be unbounded and you may 
request just the top X using facet.limit)

If you are asking about facet.field and using facet.sort to specify the 
order of the constraints for each field, then no -- i don't believe anyone 
is currently working on adding options for descending sort.

I don't think it would be hard to add if someone wanted to ... I just 
don't know that there has ever been enough demand for anyone to look into 
it.


-Hoss


Re: Solr suggest - How to define solr suggest as case insensitive

2013-09-09 Thread Chris Hostetter

: This is probably because your dictionary is made up of all lower case tokens,
: but when you query the spell-checker similar analysis doesnt happen. Ideal
: case would be when you query the spellchecker you send lower case queries

You can init the SpellCheckComponent with a queryAnalyzerFieldType 
option that will control what analysis happens.  ie...

  !-- This field type's analyzer is used by the QueryConverter to tokenize the 
value for q parameter --
  str name=queryAnalyzerFieldTypephrase_suggest/str


...it would be nice if this defaulted to using the fieldType of hte field 
you configure on the Suggester, but not all Impls are based on the index 
(you might be using an external dict file) so it has to be explicitly 
configured, and defaults to using a simple WhitespaceAnalyzer.


-Hoss


Re: charfilter doesn't do anything

2013-09-09 Thread Jack Krupansky

Did you at least try the pattern I gave you?

The point of the curl was the data, not how you send the data. You can just 
use the standard Solr simple post tool.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Monday, September 09, 2013 6:40 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i've downloaded curl and tried it in the comman prompt and power shell on my 
win 2008r2 server, thats why i used my dataimporter with a single line html 
file and copy/pastet the lines into schema.xml



On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:


Did you in fact try my suggested example? If not, please do so.

-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Monday, September 09, 2013 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i index html pages with a lot of lines and not just a string with the 
body-tag.
it doesn't work with proper html files, even though i took all the new 
lines out.


html-file:
htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html

solr update debug output:
text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
content=\ISO-8859-1\\r\nmeta name=\Content-Type\ 
content=\text/html; 
charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
will ich sehenfooter-content/body/html]




On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

I tried this and it seems to work when added to the standard Solr example 
in 4.4:


field name=body type=text_html_body indexed=true stored=true /

fieldType name=text_html_body class=solr.TextField 
positionIncrementGap=100 

analyzer
 charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

That char filter retains only text between body and /body. Is that 
what you wanted?


Indexing this data:

curl 'localhost:8983/solr/update?commit=true' -H 
'Content-type:application/json' -d '

[{id:doc-1,body:abc bodyA test./body def}]'

And querying with these commands:

curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
Shows all data

curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json;
shows the body text

curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;
Shows nothing, HTML tag stripped

In your original query, you didn't show us what your default field, df 
parameter, was.


-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Sunday, September 08, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:


Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the body, perhaps, but might it help?


On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:


ok i have html pages with html.!--body--content i
want!--/body--./html. i want to extract (index, store) only
that between the body-comments. i thought regexTransformer would be the
best because xpath doesn't work in tika and i cant nest a
xpathEntetyProcessor to use xpath. what i have also found out is that 
the

htmlparser from tika cuts my body-comments out and tries to make well
formed html, which i would like to switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:


On 9/6/2013 7:09 AM, Andreas Owen wrote:
i've managed to get it working if i use the regexTransformer and 
string
is on the same line in my tika entity. but when the string is 
multilined it

isn't working even though i tried ?s to set the flag dotall.


entity name=tika processor=TikaEntityProcessor url=${rec.url}

dataSource=dataUrl onError=skip htmlMapper=identity format=html
transformer=RegexTransformer

  field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;

replaceWith=QQQ sourceColName=text  /

/entity

then i tried it like this and i get a stackoverflow

field column=text_html 
regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;

replaceWith=QQQ sourceColName=text  /


in javascript this works but maybe because i only used a small 
string.


Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have 
happen

and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the 
HTML

tags out.  Perhaps the HTMLStripCharFilter?



http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory


Something that I already said: By using the 

Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?

2013-09-09 Thread Yago Riveiro
If you have 15K collections I guess that you are doing custom sharding and not 
using collection sharding.

My first approach was the same as you are doing. In fact, I have the same lote 
of cores issue. I use the Djute.maxbuffer without any issue.

In last versions, Solr implements a way to do sharding using a prefix in your 
ID, therefore I replace my lot of cores with a collection with shards. Now with 
the splitshard feature you can split the shards that reach a condiserable size.

Downside, I don't know if the splitshard feature honors the compositeId defined 
on collection's creation.

Recommendation, if you don't want that the lot of cores issue bites you in some 
kind of wierd issue or anomalous behavior try to reduce the cores as possible 
and splits shards as necessary when performance can hurt your environment.

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, September 9, 2013 at 3:09 PM, diyun2008 wrote:

 I just found this option -Djute.maxbuffer in zookeeper admin document. But
 it's a Unsafe Options. I can't really know what it mean. Maybe that will
 bring some unstable problems? Does someone have some real practical
 experiences when using this parameter? I will have at least 15K collections.
 Or I will have to merge them to small numbers.
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088878.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 




Re: charfilter doesn't do anything

2013-09-09 Thread Jack Krupansky
Use XML then. Although you will need to escape the XML special characters as 
I did in the pattern.


The point is simply: Quickly and simply try to find the simple test scenario 
that illustrates the problem.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Monday, September 09, 2013 7:05 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i tried but that isn't working either, it want a data-stream, i'll have to 
check how to post json instead of xml


On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:


Did you at least try the pattern I gave you?

The point of the curl was the data, not how you send the data. You can 
just use the standard Solr simple post tool.


-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Monday, September 09, 2013 6:40 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i've downloaded curl and tried it in the comman prompt and power shell on 
my win 2008r2 server, thats why i used my dataimporter with a single line 
html file and copy/pastet the lines into schema.xml



On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:


Did you in fact try my suggested example? If not, please do so.

-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Monday, September 09, 2013 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i index html pages with a lot of lines and not just a string with the 
body-tag.
it doesn't work with proper html files, even though i took all the new 
lines out.


html-file:
htmlnav-contentbody nur das will ich 
sehen/bodyfooter-content/html


solr update debug output:
text_html: [html\r\n\r\nmeta name=\Content-Encoding\ 
content=\ISO-8859-1\\r\nmeta name=\Content-Type\ 
content=\text/html; 
charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das 
will ich sehenfooter-content/body/html]




On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

I tried this and it seems to work when added to the standard Solr 
example in 4.4:


field name=body type=text_html_body indexed=true stored=true /

fieldType name=text_html_body class=solr.TextField 
positionIncrementGap=100 

analyzer
charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 /

tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

That char filter retains only text between body and /body. Is that 
what you wanted?


Indexing this data:

curl 'localhost:8983/solr/update?commit=true' -H 
'Content-type:application/json' -d '

[{id:doc-1,body:abc bodyA test./body def}]'

And querying with these commands:

curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json;
Shows all data

curl 
http://localhost:8983/solr/select/?q=body:testindent=truewt=json;

shows the body text

curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json;
shows nothing (outside of body)

curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json;
shows nothing (outside of body)

curl 
http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json;

Shows nothing, HTML tag stripped

In your original query, you didn't show us what your default field, df 
parameter, was.


-- Jack Krupansky

-Original Message- From: Andreas Owen
Sent: Sunday, September 08, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:


Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the body, perhaps, but might it help?


On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote:


ok i have html pages with html.!--body--content i
want!--/body--./html. i want to extract (index, store) 
only
that between the body-comments. i thought regexTransformer would be 
the

best because xpath doesn't work in tika and i cant nest a
xpathEntetyProcessor to use xpath. what i have also found out is that 
the

htmlparser from tika cuts my body-comments out and tries to make well
formed html, which i would like to switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:


On 9/6/2013 7:09 AM, Andreas Owen wrote:
i've managed to get it working if i use the regexTransformer and 
string
is on the same line in my tika entity. but when the string is 
multilined it

isn't working even though i tried ?s to set the flag dotall.


entity name=tika processor=TikaEntityProcessor url=${rec.url}
dataSource=dataUrl onError=skip htmlMapper=identity 
format=html

transformer=RegexTransformer

 field column=text_html regex=lt;bodygt;(.+)lt;/bodygt;

replaceWith=QQQ sourceColName=text  /

/entity

then i tried it like this and i get a stackoverflow

field column=text_html 
regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;


Re: Expunge deleting using excessive transient disk space

2013-09-09 Thread Chris Hostetter

:  Looking on the infostream I can see that the first merges do succeed but
: older segments are kept in reference thus cannot be deleted until all the
: merging are done.

I suspect what you are seeing is that filehandles for the older segemnts 
are kept open (and thus, the bytes on disk for those old segments are not 
free'd p for use by new segments) because hte existing IndexReaders still 
need to use them until the completion of the merge process, and new 
IndexReader/IndexSearcher opening/warming takes place, *and* all execiting 
requests that used the old IndexSearcher completed.


-Hoss


Re: Data import

2013-09-09 Thread Chris Hostetter

: Any form of indexing would always replace a document and never update it.

At a very low level this is true, but Solr does support Atomic Updates 
(aka Partial Updates) that can be used to allow a lcient to only specify 
the values of an existing document they want to chagne and Solr will 
handle everything on the server side.

: But i still dont get one thing, if i have two indexes that i try to merge
: and both the indexes have some documents with same unique ids, they dont
: overwrite each other. Instead what i have is two documents with same unique
: id. Why does this happen? Anyone any clues?

This seems like a completley unrelated question -- please start a new 
thread and provide full details of your situation and question in ordre 
for people to try to assist you...


https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss


Re: Data import

2013-09-09 Thread Chris Hostetter

: When i run dataimport/?command=full-importclean=false, solr add new 
: documents with the information. But if the same information already 
: exists with the same uniquekey, it replaces the existing document with a 
: new one.
: It does not update the document, it creates a new one. It's that possible?

I'm not certain that i'm understanding your question.

It is possible using Atomic Updates, but you have to be explicit 
about what/how you wnat Solr to use the new information (ie: when to 
replace, when to add to a multivaluded field, when to increment a numeric 
field, etc...)

https://wiki.apache.org/solr/Atomic_Updates

I don't think DIH has any straight forward syntax for letting you 
configure this easily, but as long as you put a map in each 
field (ie: via ScriptTransformer perhaps) containing a single modifier = 
value pair you want applied to that field, it should work.

: I'm indexing rss feeds. I run the rss example that exists in the solr 
: examples, and i does that.

Can you please be more specific about what you would like to see happen, 
we can better understand what your actual goal is?  It's really not clear 
if using Atomic Updates is the easiest way to achieve what you're after, 
or if I'm just completley missunderstanding your question...

https://wiki.apache.org/solr/UsingMailingLists

-Hoss


Re: Data import

2013-09-09 Thread Luis Portela Afonso
So I'm indexing RSS feeds.
I'm running the data import full-import command with a cron job. It runs
every 15 minutes and indexes a lot of RSS feeds from many sources.

With cron job, I do a http request using curl, to the address
http://localhost:port/solr/core/dataimport/?command=full-importclean=false

When it runs, if the rss source has a feed that is already indexed on solr,
it updates the existing source.
So if the source has the same information of the destiny, it updates the
information on the destiny.

I want to prevent that. Is that explicit? I may try to provide some
examples.

Thanks

On Tuesday, September 10, 2013, Chris Hostetter wrote:


 : When i run dataimport/?command=full-importclean=false, solr add new
 : documents with the information. But if the same information already
 : exists with the same uniquekey, it replaces the existing document with a
 : new one.
 : It does not update the document, it creates a new one. It's that
 possible?

 I'm not certain that i'm understanding your question.

 It is possible using Atomic Updates, but you have to be explicit
 about what/how you wnat Solr to use the new information (ie: when to
 replace, when to add to a multivaluded field, when to increment a numeric
 field, etc...)

 https://wiki.apache.org/solr/Atomic_Updates

 I don't think DIH has any straight forward syntax for letting you
 configure this easily, but as long as you put a map in each
 field (ie: via ScriptTransformer perhaps) containing a single modifier =
 value pair you want applied to that field, it should work.

 : I'm indexing rss feeds. I run the rss example that exists in the solr
 : examples, and i does that.

 Can you please be more specific about what you would like to see happen,
 we can better understand what your actual goal is?  It's really not clear
 if using Atomic Updates is the easiest way to achieve what you're after,
 or if I'm just completley missunderstanding your question...

 https://wiki.apache.org/solr/UsingMailingLists

 -Hoss



-- 
Sent from Gmail Mobile


Re: Data import

2013-09-09 Thread Chris Hostetter

: With cron job, I do a http request using curl, to the address
: http://localhost:port/solr/core/dataimport/?command=full-importclean=false
: 
: When it runs, if the rss source has a feed that is already indexed on solr,
: it updates the existing source.
: So if the source has the same information of the destiny, it updates the
: information on the destiny.
: 
: I want to prevent that. Is that explicit? I may try to provide some
: examples.

Yes, specific examples would be helpful -- it's not really clear what it 
is that you want to prevent.

Please note the URL i mentioned before and use it as a guideline for 
how much detail we need to understand what it is you are asking...

:  Can you please be more specific about what you would like to see happen,
:  we can better understand what your actual goal is?  It's really not clear

:  https://wiki.apache.org/solr/UsingMailingLists



-Hoss


Re: Data import

2013-09-09 Thread Luis Portela Afonso
But with atomic updates i need to send the information, right?

I want that solr automatic indexes it. And he is doing that. Can you look
at the solr example in the source?
There is an example on example-DIH folder.

Imagine that you run the URL to import the data every 15 minutes. If the
same information is already indexed, solr will update it, and by update I
mean delete and index again.

I just want that solr simple discards the information if this already
exists with indexed.

On Tuesday, September 10, 2013, Chris Hostetter wrote:


 : With cron job, I do a http request using curl, to the address
 : http://localhost:port
 /solr/core/dataimport/?command=full-importclean=false
 :
 : When it runs, if the rss source has a feed that is already indexed on
 solr,
 : it updates the existing source.
 : So if the source has the same information of the destiny, it updates the
 : information on the destiny.
 :
 : I want to prevent that. Is that explicit? I may try to provide some
 : examples.

 Yes, specific examples would be helpful -- it's not really clear what it
 is that you want to prevent.

 Please note the URL i mentioned before and use it as a guideline for
 how much detail we need to understand what it is you are asking...

 :  Can you please be more specific about what you would like to see
 happen,
 :  we can better understand what your actual goal is?  It's really not
 clear

 :  https://wiki.apache.org/solr/UsingMailingLists



 -Hoss



-- 
Sent from Gmail Mobile


Re: Data import

2013-09-09 Thread Alexandre Rafalovitch
Sounds like you want a custom UpdateRequestProcessor chain that checks if
the document already exists with given primary key and does not even bother
passing it on to the next processor in the chain.

This would make sense as an optimization or as a first step in a complex
update chain that perhaps uses a lot of external resources to pre-process
the content (e.g. named entities extraction).

I don't think such URP exist at the moment? But it should be simple to
write one assuming URPs can do lookups by primary IDs and have go/no-go
decisions on individual documents. Anybody knows the details of this?

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Sep 10, 2013 at 7:53 AM, Luis Portela Afonso meligalet...@gmail.com
 wrote:

 But with atomic updates i need to send the information, right?

 I want that solr automatic indexes it. And he is doing that. Can you look
 at the solr example in the source?
 There is an example on example-DIH folder.

 Imagine that you run the URL to import the data every 15 minutes. If the
 same information is already indexed, solr will update it, and by update I
 mean delete and index again.

 I just want that solr simple discards the information if this already
 exists with indexed.

 On Tuesday, September 10, 2013, Chris Hostetter wrote:

 
  : With cron job, I do a http request using curl, to the address
  : http://localhost:port
  /solr/core/dataimport/?command=full-importclean=false
  :
  : When it runs, if the rss source has a feed that is already indexed on
  solr,
  : it updates the existing source.
  : So if the source has the same information of the destiny, it updates
 the
  : information on the destiny.
  :
  : I want to prevent that. Is that explicit? I may try to provide some
  : examples.
 
  Yes, specific examples would be helpful -- it's not really clear what it
  is that you want to prevent.
 
  Please note the URL i mentioned before and use it as a guideline for
  how much detail we need to understand what it is you are asking...
 
  :  Can you please be more specific about what you would like to see
  happen,
  :  we can better understand what your actual goal is?  It's really not
  clear
 
  :  https://wiki.apache.org/solr/UsingMailingLists
 
 
 
  -Hoss
 


 --
 Sent from Gmail Mobile



find all two word phrases that appear in more than one document

2013-09-09 Thread Ali, Saqib
Dear Solr Ninjas,

We would like to run a query that returns two word phrases that appear in
more than one document. So for e.g. take the string Solr Ninja. Since it
appears in more than one document in our Solr instance, the query should
return that. The query should  find all such phrases from all the documents
in our Solr instance, by querying for two adjacent word combination
(forming a phrase) in the documents that are in the Solr. These two
adjacent word combinations should come from the documents in the Solr index.

Any ideas on how to write this query?

Thanks.


Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?

2013-09-09 Thread diyun2008
Thank you very much for your advice.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4089009.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: find all two word phrases that appear in more than one document

2013-09-09 Thread Alexandre Rafalovitch
The phases are usually called n-grams or shingles.

You can probably use ShingleFilterFactory to create your shingles (possibly
with outputUnigrams=false) and then use TermsComponent (
http://wiki.apache.org/solr/TermsComponent) to list the results.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Sep 10, 2013 at 8:22 AM, Ali, Saqib docbook@gmail.com wrote:

 Dear Solr Ninjas,

 We would like to run a query that returns two word phrases that appear in
 more than one document. So for e.g. take the string Solr Ninja. Since it
 appears in more than one document in our Solr instance, the query should
 return that. The query should  find all such phrases from all the documents
 in our Solr instance, by querying for two adjacent word combination
 (forming a phrase) in the documents that are in the Solr. These two
 adjacent word combinations should come from the documents in the Solr
 index.

 Any ideas on how to write this query?

 Thanks.



Re: Restrict Parsing duplicate file in Solr

2013-09-09 Thread shabbir
Thanks for the response. My requirement is make sure I detect file  if its
already indexed , neglect instead of replacing the existing one.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471p4089023.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: find all two word phrases that appear in more than one document

2013-09-09 Thread Ali, Saqib
Thanks Alexandre. I looked at the wiki page for the TermsComponent. But I
am not sure if I follow. Do you have an example or some better document?
Thanks! :)


On Mon, Sep 9, 2013 at 8:17 PM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 The phases are usually called n-grams or shingles.

 You can probably use ShingleFilterFactory to create your shingles (possibly
 with outputUnigrams=false) and then use TermsComponent (
 http://wiki.apache.org/solr/TermsComponent) to list the results.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Tue, Sep 10, 2013 at 8:22 AM, Ali, Saqib docbook@gmail.com wrote:

  Dear Solr Ninjas,
 
  We would like to run a query that returns two word phrases that appear in
  more than one document. So for e.g. take the string Solr Ninja. Since
 it
  appears in more than one document in our Solr instance, the query should
  return that. The query should  find all such phrases from all the
 documents
  in our Solr instance, by querying for two adjacent word combination
  (forming a phrase) in the documents that are in the Solr. These two
  adjacent word combinations should come from the documents in the Solr
  index.
 
  Any ideas on how to write this query?
 
  Thanks.
 



Does configuration change requires Zookeeper restart?

2013-09-09 Thread Prasi S
Hi,
I have solrcloud with two collections. I have indexed 100Million docs to
the first collection.

I need some changes to the solr configuration files. Im going to index the
new data tot he second collection. What are the steps that i should follow?
Should i restart the zookeeper?

Pls suggest


Thanks,
Prasi