AW: correlation between score and term frequency

2007-10-02 Thread Alexander Kubias
Yes, that was the meaning of my question! Can you answer it?

-Ursprüngliche Nachricht-
Von: Joseph Doehr [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 1. Oktober 2007 20:00
An: solr-user@lucene.apache.org
Betreff: Re: correlation between score and term frequency



Hi Alex,

do you mean, you like to know if both results have the same relevance
through the whole content which is indexed and if both results are
direct comparable?


[EMAIL PROTECTED] schrieb:
 I have a question about the correlation between the score value and 
 the term frequency. Let's assume that we have one index about one set 
 of documents. In addition to that, let's assume that there is only one

 term in a query.
 
 If we now search for the term car and get a certain score value X, 
 and if we then search for the term football and get the same score 
 value X. Is it now sure that both values X are the same?
 
 Could you explain, what correlation between the score value and the 
 term frequency exists in my scenario?



unable to figure out nutch type highlighting in solr....

2007-10-02 Thread Ravish Bhagdev
I have tried very hard to follow documentation and forums that try to
answer questions about how to return snippets with highlights for
relevant searched term using Solr (as nutch does with such ease).

I will be really grateful if someone can guide me with basics, i have
made sure that the field to be highlighted is stored in index etc.
Still I can't figure out why it doesn't return the snippet and instead
returns the whole document.

I have tried all different highlight parameters with variations, but
no idea what's happening.  Can I test highlighting using given
application using full search interface option?  How, it just
returns xml with full document between field tag at the moment.

Please find attached my conf files as well
?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

config
  !-- Set this to 'false' if you want solr to continue working after it has 
   encountered an severe configuration error.  In a production environment, 
   you may want solr to keep working even if one handler is mis-configured.

   You may also set this to false using by setting the system property:
 -Dsolr.abortOnConfigurationError=false
 --
  abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError

  !-- Used to specify an alternate directory to hold all index data
   other than the default ./data under the Solr home.
   If replication is in use, this should match the replication configuration. --
  !--
  dataDir./solr/data/dataDir
  --

  indexDefaults
   !-- Values here affect all index writers and act as a default unless overridden. --
useCompoundFilefalse/useCompoundFile
mergeFactor5/mergeFactor
maxBufferedDocs100/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
  /indexDefaults

  mainIndex
!-- options specific to the main on-disk lucene index --
useCompoundFilefalse/useCompoundFile
mergeFactor5/mergeFactor
maxBufferedDocs1000/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength

!-- If true, unlock any held write or commit locks on startup. 
 This defeats the locking mechanism that allows multiple
 processes to safely access a lucene index, and should be
 used with care. --
unlockOnStartupfalse/unlockOnStartup
  /mainIndex

  !-- the default high-performance update handler --
  updateHandler class=solr.DirectUpdateHandler2

!-- A prefix of solr. for class names is an alias that
 causes solr to search appropriate packages, including
 org.apache.solr.(search|update|request|core|analysis)
 --

!-- autocommit pending docs if certain criteria are met 
autoCommit 
  maxDocs1/maxDocs
  maxTime1000/maxTime
/autoCommit
--
autoCommit 
  maxDocs1000/maxDocs
  maxTime1000/maxTime
/autoCommit

!-- The RunExecutableListener executes an external command.
 exe - the name of the executable to run
 dir - dir to use as the current working directory. default=.
 wait - the calling thread waits until the executable returns. default=true
 args - the arguments to pass to the program.  default=nothing
 env - environment variables to set.  default=nothing
  --
!-- A postCommit event is fired after every commit or optimize command
listener event=postCommit class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dirsolr/bin/str
  bool name=waittrue/bool
  arr name=args strarg1/str strarg2/str /arr
  arr name=env strMYVAR=val1/str /arr
/listener
--
!-- A postOptimize event is fired only after every optimize command, useful
 in conjunction with index distribution to only distribute optimized indicies 
listener event=postOptimize class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dirsolr/bin/str
  bool name=waittrue/bool
/listener
--

  /updateHandler


  query
!-- Maximum number of clauses in a boolean query... can affect
range or prefix queries that expand 

Re: Searching combined English-Japanese index

2007-10-02 Thread Maximilian Hütter
Yonik Seeley schrieb:
 On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 Yonik Seeley schrieb:
 On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 When I search using an English term, I get results but the Japanese is
 not encoded correctly in the response. (although it is UTF-8 encoded)
 One quick thing to try is the python writer (wt=python) to see the
 actual unicode values of what you are getting back (since the python
 writer automatically escapes non-ascii).  That can help rule out
 incorrect charset handling by clients.

 -Yonik

 Thanks for the tip, it turns out that the unicode values are wrong... I
 mean the browser displays correctly what is send. But I don't know how
 solr gets these values.
 
 OK, so they never got into the index correctly.
 The most likely explanation is that the charset wasn't set correctly
 when the update message was sent to Solr.
 
 -Yonik
 
Are you sure, they are wrong in the index? When I use the Lucene Index
Monitor (http://limo.sourceforge.net/) to look at the document in the
index the Japanese is displayed correctly.
I am using Jetty 6.0.1 by the way.

Best regards,

Max

-- 
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel:  (+49) 0711 - 45 10 17 578
Fax:  (+49) 0711 - 45 10 17 573
e-mail :  [EMAIL PROTECTED]
Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich


Re: Index multiple languages with multiple analyzers with the same field

2007-10-02 Thread Daniel Alheiros
Same Here.

But I can't see how to fit into this UNLESS you are going to create an
analyzer to handle a language parameter and based on it would be able to
apply a set of filters (and sometimes you want a different - but compatible
- set of filters in indexing/query time). It would work, but doing so we
lose the advantage of having Solr config were we can change and experiment
alternative analyzers/tokenizers/filters compositions...

What I've done is I created one specific text field per language and created
a dismax request handler per language (using language name or ISO name) and
it is very flexible and appropriate for each language.

I've also created for management simplicity a dismax handler that allows me
to query all documents no matter in which language it is. May be useful for
you too.

Regards,
Daniel Alheiros




On 29/9/07 03:29, Lance Norskog [EMAIL PROTECTED] wrote:

 Other people custom-create a separate dynamic field for each language they
 want to support.  The spellchecker in Solr 1.2 wants just one field to use
 as its word source, so this fits.
 
 We have a more complex version of this problem: we have content with both
 English and other languages. Searching is one problem; we also want to have
 spelling correction dictionaries for each language. We have many world
 languages which need very different handling and semantics, like CJK
 processing. We will have to use the multiple-field trick; I don't think we
 can shoehorn our complexity into this technique. It is a valiant effort,
 though.
 
 It's possible we could separate out the different-language words in the
 document, put them each in separate words_en_text, word_sp_text, etc. and
 make the default search field out of
 copyField source=*_text dest=defaultText/
 Hmm.
 
 Lance
 
 -Original Message-
 From: Thom Nelson [mailto:[EMAIL PROTECTED]
 Sent: Friday, September 28, 2007 12:07 PM
 To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
 Subject: Re: Index multiple languages with multiple analyzers with the same
 field
 
 I had the same problem, but never found a good solution.  The best solution
 is to have a more dynamic way of determining which analyzer to return, such
 as having some kind of conditional expression evalution in the
 fieldType/analyzer element, where either the document or the query request
 could be used as the comparison object.
 
 fieldtype type=textMultiLingual class=solr.TextField
 analyzer type=query expression=request.lang == 'EN'
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 /fieldtype
 
 Analyzers could still be cached by adding the expression to the cache key.
 
 Unfortunately I have switched jobs, so I don't have the time or motivation
 to do this, but it should be a very useful addition.
 
 - Thom
 
 Wu, Daniel wrote:
 Hi,
  
 I know this probably has been asked before, but I was not able to find
 it in the mailing list.  So forgive me if I repeated the same question.
  
 We are trying to build a search application to support multiple
 languages.  Users can potentially query with any language.  First
 thought come to us is to index the text of all languages in the same
 field using language specific analyzer.  As all the data are indexed
 in the same field, it would just find results with the language that
 matches the user query.
  
 Looking at the Solr schema, it seems each field can have one and only
 analyzer.  Is it possible to have multiple analyzers for the same field?
  
 Or is there any other approaches that can achieve the same thing?
  
 Daniel
 
   
 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: Problem with html code inside xml

2007-10-02 Thread [EMAIL PROTECTED]

Thanks

I use this solution:

put  ![CDATA[  Here my hml code   ]] in the xml to be indexed and  
it works, nothing to change in the xsl.


In the schema I use this fieldType

fieldType name=html class=solr.TextField  
positionIncrementGap=100

analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
 	filter class=solr.WordDelimiterFilterFactory  
generateWordParts=1 generateNumberParts=1 catenateWords=1  
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/
 	filter class=solr.StopFilterFactory ignoreCase=true  
words=stopwords.txt/

filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
 /fieldType

--
Now question:
I created a field to index only the text for this html code.

I created a field type:

fieldType name=htmlTxt class=solr.TextField  
positionIncrementGap=100

analyzer
tokenizer class=solr.HTMLStripWhitespaceTokenizerFactory/
 	filter class=solr.WordDelimiterFilterFactory  
generateWordParts=1 generateNumberParts=1 catenateWords=1  
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/
 	filter class=solr.StopFilterFactory ignoreCase=true  
words=stopwords.txt/

filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
 /fieldType

Everything works (the div tags, p tags are removed) but some  
strongnnn/strong   or br/ tags are style in the text after  
indexing.


If you've got any idea to solve this problem it we'll be great.

Thanks

S. Christin



-


Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :


On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:

If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):

field name=storyFullText
  html/html
/field

I think you should encode your content to protect these xml entities:
  -  lt;

- gt;

 - quot;
 - amp;

If you use perl, have a look at HTML::Entities.


AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.

Have a look at the thread
http://marc.info/?t=11677583791r=1w=2
and especially at
http://marc.info/?l=solr-userm=116782664828926w=2

HTH

salu2




On 9/25/07, [EMAIL PROTECTED] [EMAIL PROTECTED]  
wrote:

Hello,

I've got some problem with html code who is embedded in xml file:

Sample source .

content
stories
div class=storyTitle
 Les débats
/div
div class=storyIntroductionText
Le premier tour des élections fédérales  
se déroulera le 21

octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
vous, dont plusieurs grands débats à l'enseigne de Forums.
/div
div class=paragraph
div class=paragraphTitle/
div class=paragraphText
my para textehere
br/
br/
Vous trouverez sur cette page  
toutes les dates et les heures de

ces différents rendez-vous ainsi que le nom et les partis des
débatteurs. De plus, vous pourrez également écouter ou  
réécouter

l'ensemble de ces émissions.
/div
/div

-
When a make a query on solr I've got something like that in the
source code of the xml result:

td xmlns=http://www.w3.org/1999/xhtml;
span class=markuplt;/span
span class=start-tagdiv/span
span class=attribute-nameclass/span
span class=markup=/span
span class=attribute-valueparagraph/span
span class=markupgt;/spandiv class=expander-content
div class=indentspan class=markuplt;/span
span class=start-tagdiv/span
span class=attribute-nameclass/span
span class=markup=/span
span class=attribute-valueparagraphTitle/span
span class=markup/gt;/span/divtabletr
td class=expander−div class=spacer/
/tdtdspan class=markuplt;/span
...

It is not exactly what I want. I want to keep the html tags, that  
all

without formatting.

So the br tags and a tags are well formed in xml and json result,  
but

the div tags are not kept.
-
In the schema.xml I've got this for the html content

fieldType name=html class=solr.TextField /

  field name=storyFullText type=html indexed=true
stored=true multiValued=true/

-

Any help would be appreciate.

Thanks in advance.

S. Christin










--
Thorsten Scherler  
thorsten.at.apache.org
Open Source Java  consulting, training and  
solutions






Re: Letter-number transitions - can this be turned off

2007-10-02 Thread F Knudson

Thanks for your helpful suggestions.

I have considered other analyzers but WDF has great strengths.  I will
experiment with maintaining transitions and then consider modifying the
code.

F. Knudson


Mike Klaas wrote:
 
 On 30-Sep-07, at 12:47 PM, F Knudson wrote:
 

 Is there a flag to disable the letter-number transition in the
 solr.WordDelimiterFilterFactory?  We are indexing category codes,  
 thesaurus
 codes for which this letter number transition makes no sense.  It is
 bloating the indexing (which is already large).
 
 Have you considered using a different analyzer?
 
 If you want to continue using WDF, you could make a quick change  
 around since 320:
 
  if (splitOnCaseChange == 0 
  (lastType  ALPHA) != 0  (type  ALPHA) != 0) {
// ALPHA-ALPHA: always ignore if case isn't considered.
 
  } else if ((lastType  UPPER)!=0  (type  LOWER)!=0) {
// UPPER-LOWER: Don't split
  } else {
 
   ...
 
 by adding a clause that catches ALPHA - NUMERIC (and vice versa) and  
 ignores it.
 
 Another approach that I am using locally is to maintain the  
 transitions, but force tokens to be a minimum size (so r2d2 doesn't  
 tokenize to four tokens but arrrdeee does).
 
 There is a patch here: http://issues.apache.org/jira/browse/SOLR-293
 
 If you vote for it, I promise to get it in for 1.3 g
 
 -Mike
 
 

-- 
View this message in context: 
http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a13003019
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Searching combined English-Japanese index

2007-10-02 Thread Lance Norskog
Python does not do Unicode strings natively, you have to do them explicitly.
It is possible that your python receiver is not doing the right thing with
the incoming strings.  Also, Jetty has problems with UTF-8; the Wiki has
more on this.

Lance 

-Original Message-
From: Maximilian Hütter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 02, 2007 1:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching combined English-Japanese index

Yonik Seeley schrieb:
 On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 Yonik Seeley schrieb:
 On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 When I search using an English term, I get results but the Japanese 
 is not encoded correctly in the response. (although it is UTF-8 
 encoded)
 One quick thing to try is the python writer (wt=python) to see the 
 actual unicode values of what you are getting back (since the python 
 writer automatically escapes non-ascii).  That can help rule out 
 incorrect charset handling by clients.

 -Yonik

 Thanks for the tip, it turns out that the unicode values are wrong... 
 I mean the browser displays correctly what is send. But I don't know 
 how solr gets these values.
 
 OK, so they never got into the index correctly.
 The most likely explanation is that the charset wasn't set correctly 
 when the update message was sent to Solr.
 
 -Yonik
 
Are you sure, they are wrong in the index? When I use the Lucene Index
Monitor (http://limo.sourceforge.net/) to look at the document in the index
the Japanese is displayed correctly.
I am using Jetty 6.0.1 by the way.

Best regards,

Max

--
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel:  (+49) 0711 - 45 10 17 578
Fax:  (+49) 0711 - 45 10 17 573
e-mail :  [EMAIL PROTECTED]
Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich



Re: Searching combined English-Japanese index

2007-10-02 Thread Yonik Seeley
On 10/2/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 Are you sure, they are wrong in the index?

It's not an issue with Jetty output encoding since the python writer
takes the string and converts it to ascii before that.  Since Solr
does no charset encoding itself on output, that must mean that it's in
the index incorrectly.

 When I use the Lucene Index
 Monitor (http://limo.sourceforge.net/) to look at the document in the
 index the Japanese is displayed correctly.

I've never really used limo, but it's possible it's incorrectly
interpreting what's in the index (and by luck doing the reverse
transformation that got the data in there incorrectly).

Try indexing a document with a unicode character specified via an
entity, to remove the issues of input char encodings.  For example if
a Japanese char has a unicode value of \u1234, then in the XML doc,
use #x1234

-Yonik


schema for response

2007-10-02 Thread Yu-Hui Jin
Hi, there,

Given that there's some questions on the updated XML schema for the response
in Solr 1.2.  Can someone points me to the XML schema? Is it documented
somewhere?

I'm particularly interested in the different status code we would have in
the response for either update or select.


-- 
Regards,

-Hui


Re: schema for response

2007-10-02 Thread Ryan McKinley

Yu-Hui Jin wrote:

Hi, there,

Given that there's some questions on the updated XML schema for the response
in Solr 1.2.  Can someone points me to the XML schema? Is it documented
somewhere?

I'm particularly interested in the different status code we would have in
the response for either update or select.



In 1.2, /update and /select can share the same response format if you 
set: requestDispatcher handleSelect=true  in solrconfig.xml


All status codes in 1.2 should map to standard HTTP status cods -- 200 
is ok, 400 bad request, 500 - some server error etc...


ryan




Re: dataset parameters suitable for lucene application

2007-10-02 Thread Chris Harris
Hi There,

Would you mind if I pasted your data onto the wiki page at

http://wiki.apache.org/solr/SolrPerformanceData

I think it would be helpful to get some more numbers on that page, so
people can help decide if Solr is the right application for them.

Thanks,
Chris Harris, new Solr user

On 9/26/07, Xuesong Luo [EMAIL PROTECTED] wrote:
 My experience so far:
 200k number of indexes were created in 90 mins(including db time), index
 size is 200m, query a key word on all string fields(30) takes 0.3-1 sec,
 query a key word on one field takes tens of mill seconds.



 -Original Message-
 From: Charlie Jackson [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 8:53 AM
 To: solr-user@lucene.apache.org
 Subject: RE: dataset parameters suitable for lucene application

 My experiences so far with this level of data have been good.

 Number of records: Maxed out at 8.8 million
 Database size: friggin huge (100+ GB)
 Index size: ~24 GB

 1) It took me about a day to index 8 million docs using a non-optimized
 program I wrote. It's non-optimized in the sense that it's not
 multi-threaded. It batched together groups of about 5,000 docs at a time
 to be indexed.

 2) Search times for a basic search are almost always sub-second. If we
 toss in some faceting, it takes a little longer, but I've hardly ever
 seen it go above 1-2 seconds even with the most advanced queries.

 Hope that helps.


 Charlie

 

 -Original Message-
 From: Law, John [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 26, 2007 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: dataset parameters suitable for lucene application

 I am new to the list and new to lucene and solr. I am considering Lucene
 for a potential new application and need to know how well it scales.

 Following are the parameters of the dataset.

 Number of records: 7+ million
 Database size: 13.3 GB
 Index Size:  10.9 GB

 My questions are simply:

 1) Approximately how long would it take Lucene to index these documents?
 2) What would the approximate retrieval time be (i.e. search response
 time)?

 Can someone provide me with some informed guidance in this regard?

 Thanks in advance,
 John

 __
 John Law
 Director, Platform Management
 ProQuest
 789 Eisenhower Parkway
 Ann Arbor, MI 48106
 734-997-4877
 [EMAIL PROTECTED]
 www.proquest.com
 www.csa.com

 ProQuest... Start here.





Solr live at Netflix

2007-10-02 Thread Walter Underwood
Here at Netflix, we switched over our site search to Solr two weeks ago.
We've seen zero problems with the server. We average 1.2 million
queries/day on a 250K item index. We're running four Solr servers
with simple round-robin HTTP load-sharing.

This is all on 1.1. I've been too busy tuning to upgrade.

Thanks everyone, this is a great piece of software.

wunder
--
Walter Underwood
Search Guy, Netflix



Re: Re: Problem with html code inside xml

2007-10-02 Thread ycrux
Hi !

I'm facing a similar problem. Some HTML docs are correctly indexed and others 
are simply rejected even I encoded all problematic HTML tags as Thorsten 
suggested.

In the following example, my_doc.xml is a valid XML file, compliant with my 
Solr's schema fields :

$ java -jar post.jar ./my_doc.xml 

SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, 
other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file solrdoc
SimplePostTool: FATAL: Connection error (is Solr running at 
http://localhost:8983/solr/update ?): java.io.IOException: Server returned HTTP 
response code: 500 for URL: http://localhost:8983/solr/update

Is there any way to let Solr to be more verbose than that ?
Do I need to go into the Java code to understand what happen?
 I'm looking for a simple solution.

Thanks in advance

cheers
Y.

Message d'origine
De: [EMAIL PROTECTED] 
Sujet: Re: Problem with html code inside xml
Date: Tue, 2 Oct 2007 16:15:26 +0200
A: solr-user@lucene.apache.org

Thanks

I use this solution:

put  ![CDATA[  Here my hml code   ]] in the xml to be indexed and  
it works, nothing to change in the xsl.

In the schema I use this fieldType

fieldType name=html class=solr.TextField  
positionIncrementGap=100
   analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.WordDelimiterFilterFactory  
generateWordParts=1 generateNumberParts=1 catenateWords=1  
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true  
words=stopwords.txt/
   filter class=solr.ISOLatin1AccentFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
  /fieldType

--
Now question:
I created a field to index only the text for this html code.

I created a field type:

fieldType name=htmlTxt class=solr.TextField  
positionIncrementGap=100
   analyzer
   tokenizer class=solr.HTMLStripWhitespaceTokenizerFactory/
   filter class=solr.WordDelimiterFilterFactory  
generateWordParts=1 generateNumberParts=1 catenateWords=1  
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true  
words=stopwords.txt/
   filter class=solr.ISOLatin1AccentFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
  /fieldType

Everything works (the div tags, p tags are removed) but some  
strongnnn/strong   or br/ tags are style in the text after  
indexing.

If you've got any idea to solve this problem it we'll be great.

Thanks

S. Christin



-


Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :

 On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
 If I understand, you want to keep the raw html code in solr like that
 (in your posting xml file):

 field name=storyFullText
   html/html
 /field

 I think you should encode your content to protect these xml entities:
   -  lt;
 - gt;
  - quot;
  - amp;

 If you use perl, have a look at HTML::Entities.

 AFAIR you cannot use tags, they always are getting transformed to
 entities. The solution is to have a xsl transformation after the
 response that transforms the entities back to tags.

 Have a look at the thread
 http://marc.info/?t=11677583791r=1w=2
 and especially at
 http://marc.info/?l=solr-userm=116782664828926w=2

 HTH

 salu2



 On 9/25/07, [EMAIL PROTECTED] [EMAIL PROTECTED]  
 wrote:
 Hello,

 I've got some problem with html code who is embedded in xml file:

 Sample source .

 content
 stories
 div class=storyTitle
  Les débats
 /div
 div class=storyIntroductionText
 Le premier tour des élections fédérales  
 se déroulera le 21
 octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
 vous, dont plusieurs grands débats à l'enseigne de Forums.
 /div
 div class=paragraph
 div class=paragraphTitle/
 div class=paragraphText
 my para textehere
 br/
 br/
 Vous trouverez sur cette page  
 toutes les dates et les heures de
 ces différents rendez-vous ainsi que le nom et les partis des
 débatteurs. De plus, vous pourrez également écouter ou  
 réécouter
 l'ensemble de ces émissions.
 /div
 /div
 
 -
 When a make a query on solr I've got something like that in the
 source code of the xml result:

 td xmlns=http://www.w3.org/1999/xhtml;
 span 

Re: Problem with html code inside xml

2007-10-02 Thread Chris Hostetter
: I created a field type:
: 
: fieldType name=htmlTxt class=solr.TextField positionIncrementGap=100

...

: Everything works (the div tags, p tags are removed) but some
: strongnnn/strong   or br/ tags are style in the text after indexing.

i cut/paste that fieldtype into the example schema.xml, and experimented 
with the analysis tool (http://localhost:8983/solr/admin/analysis.jsp) and 
both of those examples were correctly striped.

do you have a more specific example of something that doesn't work?

Hmm... it seems like maybe the problem is examples like this...
blahblahstringnnn/strong
...if the tag is direclty adjacent to other text, it may not get striped 
off ... i'm not sure if that's specific to the HtmlWhitespaceTokenizer.




-Hoss


Re: Re: Problem with html code inside xml

2007-10-02 Thread Chris Hostetter

: SimplePostTool: FATAL: Connection error (is Solr running at 
http://localhost:8983/solr/update ?): java.io.IOException: Server returned HTTP 
response code: 500 for URL: http://localhost:8983/solr/update
: 
: Is there any way to let Solr to be more verbose than that ?

Solr outputs all errors using whatever default error page format your 
servlet container uses, it also logs all errors tothe servlet containers 
loging system.

this specific error indicates that post.jar could not connect to Solr at 
all (hence the FATAL: Connection error and the hint that perhaps Solr 
isn't actually runing at the URL ypost.jar is trying to contact.) 

If you are using the example Jetty setup that comes with Solr, and you 
send a document that triggers a Solr error, post.jar will output something 
like this (in this specific error, the problem is that the document 
being posted is total giberesh, an not XML at all)...

SimplePostTool: FATAL: Solr returned an error: 
ParseError_at_rowcol11_Message_only_whitespace_content_allowed_before_start_tag_and_not___javaxxmlstreamXMLStreamException_ParseError_at_rowcol11_Message_only_whitespace_content_allowed_before_start_tag_and_not___at_combeaxmlstreamMXParserparsePrologMXParserjava2044__at_combeaxmlstreamMXParsernextImplMXParserjava1947__at_combeaxmlstreamMXParsernextMXParserjava1333__at_orgapachesolrhandlerXmlUpdateRequestHandlerprocessUpdateXmlUpdateRequestHandlerjava148__at_orgapachesolrhandlerXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHandlerjava123__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava78__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava807__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava206__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava174__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__at_orgmortbayjettyHttpConnection$RequestHandlercontentHttpConnectionjava835__at_orgmortbayjettyHttpParserparseNextHttpParserjava641__at_orgmortbayjettyHttpParserparseAvailableHttpParserjava202__at_orgmortbayjettyHttpCo



-Hoss


Re: Solr live at Netflix

2007-10-02 Thread Chris Hostetter

: Here at Netflix, we switched over our site search to Solr two weeks ago.

That's great Walter ... could I persuade you to add a few notes about this 
to...

http://wiki.apache.org/solr/PublicServers
http://wiki.apache.org/solr/SolrPerformanceData


-Hoss



Re: Solr live at Netflix

2007-10-02 Thread Walter Underwood
I think Chris Harris is doing that. I'll check it and touch it up
afterwards. Avoid race conditions. --wunder


On 10/2/07 4:26 PM, Chris Hostetter [EMAIL PROTECTED] wrote:

 
 : Here at Netflix, we switched over our site search to Solr two weeks ago.
 
 That's great Walter ... could I persuade you to add a few notes about this
 to...
 
 http://wiki.apache.org/solr/PublicServers
 http://wiki.apache.org/solr/SolrPerformanceData
 
 
 -Hoss
 



Re: Solr live at Netflix

2007-10-02 Thread Tom Hill
Nice!

And there seem to be some improvements. For example, Gamers and Gamera
no longer stem to the same word :-)

Tom

On 10/2/07, Walter Underwood [EMAIL PROTECTED] wrote:

 Here at Netflix, we switched over our site search to Solr two weeks ago.
 We've seen zero problems with the server. We average 1.2 million
 queries/day on a 250K item index. We're running four Solr servers
 with simple round-robin HTTP load-sharing.

 This is all on 1.1. I've been too busy tuning to upgrade.

 Thanks everyone, this is a great piece of software.

 wunder
 --
 Walter Underwood
 Search Guy, Netflix




question about bi-gram analysis on query

2007-10-02 Thread Keene, David
Hey guys,

I'm trying to index a field in Chinese using the CJKTokenizer, and I'm finding 
that my searches on the index are not working at all.  The index is created 
properly (looking with Luke), and when I search against it with Luke the data 
comes back as I would expect.  Also, when I use the analysis page of solr 
admin, the result is what I would expect.  On an actual search though, nothing 
is found.

Here are the relevant snippets from my confs:

fieldtype name=text_zh class=solr.TextField
  analyzer
tokenizer
  class=org.apache.solr.analysis.ja.CJKTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.TrimFilterFactory /
  /analyzer
/fieldtype

...

field name=text type=text_zh indexed=true stored=false 
multiValued=true/


So if I send in
美聯社 
it correctly creates 2 tokens
美聯  聯社  

And if I do a search in Luke and the solr analysis page for美聯, I get a hit.  
But on the actual search, I don't.

Also, I've noticed that the parsed query on luke is:
text:美聯 聯社
and in solr it is:
text:美聯 聯社 
I noticed there is an extra space in the solr parsed query.  I don't know if 
that makes a difference.

I'm really at a loss.  Does anyone know why I don’t get search hits back?

Thanks,
Dave Keene
  


Re: Solr live at Netflix

2007-10-02 Thread Norberto Meijome
On Tue, 02 Oct 2007 15:26:33 -0700
Walter Underwood [EMAIL PROTECTED] wrote:

 Here at Netflix, we switched over our site search to Solr two weeks ago.
 We've seen zero problems with the server. We average 1.2 million
 queries/day on a 250K item index. We're running four Solr servers
 with simple round-robin HTTP load-sharing.

Hi Walter,
would you mind sharing hardware specs, OS, index size, VM settings, OS specific 
tunings ? 

unless that will be added to the wiki... :)

thanks in advance,
B

_
{Beto|Norberto|Numard} Meijome

Have the courage to take your own thoughts
seriously, for they will shape you.
   Albert Einstein

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: schema for response

2007-10-02 Thread Yu-Hui Jin
Got it. Thanks, Ryan.


-Hui

On 10/2/07, Ryan McKinley [EMAIL PROTECTED] wrote:

 Yu-Hui Jin wrote:
  Hi, there,
 
  Given that there's some questions on the updated XML schema for the
 response
  in Solr 1.2.  Can someone points me to the XML schema? Is it documented
  somewhere?
 
  I'm particularly interested in the different status code we would have
 in
  the response for either update or select.
 

 In 1.2, /update and /select can share the same response format if you
 set: requestDispatcher handleSelect=true  in solrconfig.xml

 All status codes in 1.2 should map to standard HTTP status cods -- 200
 is ok, 400 bad request, 500 - some server error etc...

 ryan





-- 
Regards,

-Hui


Re: searching remote indexes

2007-10-02 Thread Venkatraman S
Well, we do not have a Solr server and all the calls to index and search
documents is done via Embedded Solr.
What is the approach then?

On 9/28/07, Mike Klaas [EMAIL PROTECTED] wrote:

 Solr's main interface is http, so you can connect to that remotely.
 Query each machine and combine the results using you own business logic.

 Alternatively, you can try out the query distribution code being
 developed in
 http://issues.apache.org/jira/browse/SOLR-303

 -Mike

 On 28-Sep-07, at 1:59 AM, Venkatraman S wrote:

  resending due to lack of response :
  [We are using embedded solr 1.2 ]
 
  I need a mechanism by which i can search over 3 remote indexes? Can
  i use
  the Lucene remote apis to access the index created via Embedded solr?
 
  -Venkat
 
  On 9/4/07, Venkatraman S [EMAIL PROTECTED] wrote:
 
  Hi,
 
  [I am new to Solr].
 
  How do i search remote indexes using Solr? I am not able to find
  suitable
  documentation on this - can you guys guide me?
 
  Regards,
  Venkat
 
  --
 
 
 
 
  --




--


Re: searching remote indexes

2007-10-02 Thread Ryan McKinley
Using embedded solr, there is no (built in) way to access remote 
indexes.  If you want to access remote indexes you need to run a server.


Solr 1.3 (trunk) includes a java client you may want to look at:
http://wiki.apache.org/solr/Solrj

If you poke around, this also includes simple ways to run solr with 
embedded jetty - letting you run a light weight server.


ryan


Venkatraman S wrote:

Well, we do not have a Solr server and all the calls to index and search
documents is done via Embedded Solr.
What is the approach then?

On 9/28/07, Mike Klaas [EMAIL PROTECTED] wrote:

Solr's main interface is http, so you can connect to that remotely.
Query each machine and combine the results using you own business logic.

Alternatively, you can try out the query distribution code being
developed in
http://issues.apache.org/jira/browse/SOLR-303

-Mike

On 28-Sep-07, at 1:59 AM, Venkatraman S wrote:


resending due to lack of response :
[We are using embedded solr 1.2 ]

I need a mechanism by which i can search over 3 remote indexes? Can
i use
the Lucene remote apis to access the index created via Embedded solr?

-Venkat

On 9/4/07, Venkatraman S [EMAIL PROTECTED] wrote:

Hi,

[I am new to Solr].

How do i search remote indexes using Solr? I am not able to find
suitable
documentation on this - can you guys guide me?

Regards,
Venkat

--




--





--