Boosting in version 1.2

2007-06-08 Thread Thierry Collogne

Hello,

Our documents contain three fields. title, keywords, content.
What we want is to give priority to the field keywords, than title and last
content.

So we did the following in our xml file that is to be indexed we put the
following

doc
 field name=keywords boost=3.0letters/field
 field name=title boost=2.0This is a test/field
 field name=content![CDATA[This is a test]]/field
/doc
doc
 field name=keywords boost=3.0foobar/field
 field name=title boost=2.0This is a test letters/field
 field name=content![CDATA[This is a test]]/field
/doc
doc
 field name=keywords boost=3.0foobar/field
 field name=title boost=2.0This is a test/field
 field name=content![CDATA[This is a test letters]]/field
/doc

In our schema.xml we have put

defaultSearchFieldtext/defaultSearchField

copyField source=titlesearch dest=text/
copyField source=keywords dest=text/
copyField source=content dest=text/

No when we do a search like this

http://localhost:8666/solr/select/?q=lettersversion=2.2start=0rows=10indent=on

We don't always get the document with letters in keywords on top. To get
this to work, we need to specify the 3 search fields like this

http://localhost:8666/solr/select/?q=content%3Aletters+OR+titlesearch%3Aletters+OR+keywords%3Alettersversion=2.2start=0rows=10indent=on

I was wondering if there is a way in Solr 1.2 to specify more than one
default search field, or is the above solution still the way to go?

Thank you,

Thierry


How does HTMLStripWhitespaceTokenizerFactory work?

2007-06-08 Thread Thierry Collogne

Hello,

I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer
with no luck.

I have a field content that contains the following field
name=content![CDATA[test  a href=testlink/a
post]]/field

When I do a search I get the following

result name=response numFound=1 start=0
doc
 str name=contenttest  lt;a href=testgt;linklt;/agt;
 post/str

 str name=idpo_1_NL/str
 str name=keywordspost/str
 str name=titlesearchThis is a test/str
/doc
/result


Is this normal? Shouldn't the html code and the white spaces be removed from
the field?

This is my config in schema.xml

fieldType name=text_ws class=solr.TextField positionIncrementGap=100
 analyzer
   tokenizer class=solr.HTMLStripWhitespaceTokenizerFactory/
 /analyzer
/fieldType

field name=content type=text_ws indexed=true stored=true
omitNorms=false/

Can someone help me with this?


How can I use dates to boost my results?

2007-06-08 Thread Daniel Alheiros
Hi

For my search use, the document freshness is a relevant aspect that should
be considered to boost results.

I have a field in my index like this:

field name=created type=date indexed=true stored=true /

How can I make a good use of this to boost my results?

I'm using the DisMaxRequestHandler to boost other textual fields based on
the query, but it would improve the results quality a lot if the date where
considered to define the score.


Best Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: Multi-language indexing and searching

2007-06-08 Thread Henrib

Hi Daniel,
If it is functionally 'ok' to search in only one lang at a time, you could
try having one index per lang. Each per-lang index would have one schema
where you would describe field types (the lang part coming through
stemming/snowball analyzers, per-lang stopwords  al) and the same field
name could be used in each of them.
You could either deploy that solution through multiple web-apps (one per
lang) (or try the patch for issue Solr-215).
Regards,
Henri


Daniel Alheiros wrote:
 
 Hi, 
 
 I'm just starting to use Solr and so far, it has been a very interesting
 learning process. I wasn't a Lucene user, so I'm learning a lot about
 both.
 
 My problem is:
 I have to index and search content in several languages.
 
 My scenario is a bit different from other that I've already read in this
 forum, as my client is the same to search any language and it could be
 accomplished using a field to define language.
 
 My questions are more focused on how to keep the benefits of all the
 protwords, stopwords and synonyms in a multilanguage situation
 
 Should I create new Analyzers that can deal with the language field of
 the
 document? What do you recommend?
 
 Regards,
 Daniel 
 
 
 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain personal
 views which are not the views of the BBC unless specifically stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in
 reliance on it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.
   
 
 

-- 
View this message in context: 
http://www.nabble.com/Multi-language-indexing-and-searching-tf3885324.html#a11027333
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multi-language indexing and searching

2007-06-08 Thread Daniel Alheiros
Hi Henri.

Thanks for your reply.
I've just looked at the patch you referred, but doing this I will lose the
out of the box Solr installation... I'll have to create my own Solr
application responsible for creating the multiple cores and I'll have to
change my indexing process to something able to notify content for a
specific core.

Can't I have the same index, using one single core, same field names being
processed by language specific components based on a field/parameter?

I will try to draw what I'm thinking, please forgive me if I'm not using the
correct terms but I'm not an IR expert.

Thinking in a workflow:
Indexing:
Multilanguage indexer receives some documents
for each document, verify the language field
if language = English then process using the
EnglishIndexer
else if language = Chinese then process using the
ChineseIndexer
else if ...

Querying:
Multilanguage Request Handler receives a request
if parameter language = English then process using the English
Request Handler
else if parameter language = Chinese then process using the
Chinese Request Handler
else if ...

I can see that in the schema field definitions, we have some language
dependent parameters... It can be a problem, as I would like to have the
same fields for all requests...

Sorry to bother, but before I split all my data this way I would like to be
sure that it's the best approach for me.

Regards,
Daniel


On 8/6/07 15:15, Henrib [EMAIL PROTECTED] wrote:

 
 Hi Daniel,
 If it is functionally 'ok' to search in only one lang at a time, you could
 try having one index per lang. Each per-lang index would have one schema
 where you would describe field types (the lang part coming through
 stemming/snowball analyzers, per-lang stopwords  al) and the same field
 name could be used in each of them.
 You could either deploy that solution through multiple web-apps (one per
 lang) (or try the patch for issue Solr-215).
 Regards,
 Henri
 
 
 Daniel Alheiros wrote:
 
 Hi, 
 
 I'm just starting to use Solr and so far, it has been a very interesting
 learning process. I wasn't a Lucene user, so I'm learning a lot about
 both.
 
 My problem is:
 I have to index and search content in several languages.
 
 My scenario is a bit different from other that I've already read in this
 forum, as my client is the same to search any language and it could be
 accomplished using a field to define language.
 
 My questions are more focused on how to keep the benefits of all the
 protwords, stopwords and synonyms in a multilanguage situation
 
 Should I create new Analyzers that can deal with the language field of
 the
 document? What do you recommend?
 
 Regards,
 Daniel 
 
 
 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain personal
 views which are not the views of the BBC unless specifically stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in
 reliance on it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.
 
 
 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



problem with schema.xml

2007-06-08 Thread mirko
Hi,

I just started playing around with Solr 1.2.  It has some nice improvements.
I noticed that errors in the schema.xml get reported in a verbose way now, but
the following steps cause a problem for me:

1. start with a correct schema.xml - Solr works fine
2. edit it in a way that is no longer correct (say, remove the /schema closing
tag - Solr works fine
3. restart the webapp (through the Tomcat manager interface) - Solr complains
that the schema.xml does not parse, fine.
4. now restart again (without fixing the schema.xml!) - Solr won't even start up
5. fix the above problem (add the closing tag) and restart via Tomcat's manager
- the webapp cannot restart showing that there is a problem:
FAIL - Application at context path /furness could not be started

The following steps might seem artificial, but assume you don't manage to fix
all the typos in your schema.xml for the first attempt.  It seems after restart
Solr gets stuck in some state and I cannot get it up and running by Tomcat's
manager, only by restarting Tomcat.

Am I missing something?
Thanks,
mirko


Re: How does HTMLStripWhitespaceTokenizerFactory work?

2007-06-08 Thread Yonik Seeley

On 6/8/07, Thierry Collogne [EMAIL PROTECTED] wrote:

I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer
with no luck.

[...]

Is this normal? Shouldn't the html code and the white spaces be removed from
the field?


For indexing purposes, yes.  The stored field you get back will be
unchanged though.
If you want to see what will be indexed, try the analysis debugger in
the admin pages.

-Yonik


Cannot index '' this character using post.jar

2007-06-08 Thread Tiong Jeffrey

Hi all,

I tried to index a document that has '' using post.jar. But during the
indexing it causes error and it wont finish the indexing. Can I know why is
this and how to prevent this? Thanks!

Jeffrey


Re: Boosting in version 1.2

2007-06-08 Thread Mike Klaas

On 8-Jun-07, at 2:07 AM, Thierry Collogne wrote:


Hello,

Our documents contain three fields. title, keywords, content.
What we want is to give priority to the field keywords, than title  
and last

content




In our schema.xml we have put

defaultSearchFieldtext/defaultSearchField

copyField source=titlesearch dest=text/
copyField source=keywords dest=text/
copyField source=content dest=text/

No when we do a search like this

http://localhost:8666/solr/select/? 
q=lettersversion=2.2start=0rows=10indent=on


We don't always get the document with letters in keywords on top.  
To get

this to work, we need to specify the 3 search fields like this


I'm surprised that that finds anything--you've specified a  
defaultSearchField that doesn't exist in the documents you posted.


http://localhost:8666/solr/select/?q=content%3Aletters+OR 
+titlesearch%3Aletters+OR+keywords% 
3Alettersversion=2.2start=0rows=10indent=on


I was wondering if there is a way in Solr 1.2 to specify more than one
default search field, or is the above solution still the way to go?


This is precisely the situation that the dismax handler was designed  
for.  Plus, you don't have to fiddle around with document boosts.


try:

 qt=dismax q=letters qf=keywords^3.0 title^2.0 content

-Mike


Re: problem with schema.xml

2007-06-08 Thread Ryan McKinley
I don't use tomcat, so I can't be particularly useful.  The behavior you 
describe does not happen with resin or jetty...


My guess is that tomcat is caching the error state.  Since fixing the 
problem is outside the webapp directory, it does not think it has 
changed so it stays in a broken state.


if you touch the .war file, does it restart ok?

but i'm just guessing...


[EMAIL PROTECTED] wrote:

Hi,

I just started playing around with Solr 1.2.  It has some nice improvements.
I noticed that errors in the schema.xml get reported in a verbose way now, but
the following steps cause a problem for me:

1. start with a correct schema.xml - Solr works fine
2. edit it in a way that is no longer correct (say, remove the /schema closing
tag - Solr works fine
3. restart the webapp (through the Tomcat manager interface) - Solr complains
that the schema.xml does not parse, fine.
4. now restart again (without fixing the schema.xml!) - Solr won't even start up
5. fix the above problem (add the closing tag) and restart via Tomcat's manager
- the webapp cannot restart showing that there is a problem:
FAIL - Application at context path /furness could not be started

The following steps might seem artificial, but assume you don't manage to fix
all the typos in your schema.xml for the first attempt.  It seems after restart
Solr gets stuck in some state and I cannot get it up and running by Tomcat's
manager, only by restarting Tomcat.

Am I missing something?
Thanks,
mirko





Re: problem with schema.xml

2007-06-08 Thread mirko
Hi Ryan,

I have my .war file located outside the webapps folder (I am using multiple
Solr instances with a config as suggested on the wiki:
http://wiki.apache.org/solr/SolrTomcat).

Nevertheless, I touched the .war file, the config file, the directory under
webapps, but nothing seems to be working.

Any other suggestions?  Is someone else experiencing the same problem?
thanks,
mirko


Quoting Ryan McKinley [EMAIL PROTECTED]:

 I don't use tomcat, so I can't be particularly useful.  The behavior you
 describe does not happen with resin or jetty...

 My guess is that tomcat is caching the error state.  Since fixing the
 problem is outside the webapp directory, it does not think it has
 changed so it stays in a broken state.

 if you touch the .war file, does it restart ok?

 but i'm just guessing...




Re: To make sure XML is UTF-8

2007-06-08 Thread Funtick


Tiong Jeffrey wrote:
 
 Thought this is not directly related to Solr, but I have a XML output from
 mysql database, but during indexing the XML output is not working. And the
 problem is part of the XML output is not in UTF-8 encoding, how can I
 convert it to UTF-8 and how do I know what kind of coding it uses in the
 first place (the data I export from the mysql database). Thanks!
 

You won't have any problem with standard JAXP and java.util.* etc. classes,
even with
comlpex MySQL data (one column is LATIN1, another is LATIN2, another is
ASCII, ...)

In Java, use standard classes: String, Long, Date. And use JAXP.
-- 
View this message in context: 
http://www.nabble.com/To-make-sure-XML-is-UTF-8-tf3891427.html#a11032117
Sent from the Solr - User mailing list archive at Nabble.com.



Re: To make sure XML is UTF-8

2007-06-08 Thread funtick

Thought this is not directly related to Solr, but I have a XML output from
mysql database, but during indexing the XML output is not working. And the
problem is part of the XML output is not in UTF-8 encoding, how can I
convert it to UTF-8 and how do I know what kind of coding it uses in the
first place (the data I export from the mysql database). Thanks!


How do you generate XML output? Output itself is usually a raw byte  
array, it uses Transport and Encoding. If you save it in a file  
system and forget about transport-layer-encoding you will get some  
new problems...



during indexing the XML output is not working

- what exactly happens, which kind of error messages?




Re: Multi-language indexing and searching

2007-06-08 Thread Chris Hostetter

: Can't I have the same index, using one single core, same field names being
: processed by language specific components based on a field/parameter?

yes, but you don't really need the complexity you describe below ... you
don't need seperate request handlers per language, just seperate fields
per language.  asusming you care about 3 concepts: title, author, body ..
in a single language index those might corrispond ot three fields, in your
index they corrispond to 3*N fields where N is the number of languages you
wnat to support...

   title_french
   title_english
   title_german
   ...
   author_french
   author_english
   ...

documents which are in english only get values for th english fields,
documents in french etc... ... unless perhaps you want to support
translations of the documents in which case you can have values
fields for multiple langagues, it's up to you.  When a user wants to query
in french, you take their input and query against the body_french field
and display the title_french field, etc...

-Hoss



Re: Solr 1.2 released

2007-06-08 Thread Jack L
Hello Yonik,

This is great news. Will it be a drop-in replacement for 1.1?
I.e., do I need to make any changes other than replacing the
jar files? I suppose the index files will still be good. Are
1.2 schema files and config files compatible with those of 1.1?

-- 
Best regards,
Jack

Thursday, June 7, 2007, 7:32:18 AM, you wrote:

 Solr 1.2 is now available for download!
 This is the first release since Solr graduated from the Incubator, and
 includes many improvements, including CSV/delimited-text data
 loading, time based auto-commit, faster faceting, negative filters,
 a spell-check handler, sounds-like word filters, regex text filters,
 and more flexible plugins.

 Solr releases can be downloaded from
 http://www.apache.org/dyn/closer.cgi/lucene/solr/

 -Yonik



Re: Solr 1.2 released

2007-06-08 Thread Yonik Seeley

On 6/8/07, Jack L [EMAIL PROTECTED] wrote:

This is great news. Will it be a drop-in replacement for 1.1?
I.e., do I need to make any changes other than replacing the
jar files? I suppose the index files will still be good. Are
1.2 schema files and config files compatible with those of 1.1?


It should be easy to upgrade.
See the release notes (CHANGES.txt)... there is a section on upgrading from 1.1

-Yonik


Re: Wildcards / Binary searches

2007-06-08 Thread Chris Hostetter

: Do you mean something like below ?
: field name=autocompletew wo wor word/field

yeah, but there are some Tokenizers that make this trivial
(EdgeNGramTokenizer i think is the name)


: project, definitively not a good practice for portability of indexes. A
: duplicate field with an analyser to produce a sortable ASCII version
: would be better.

exactly ... I think conceptually the methodology for solving the problem
is very similar to the way the SpellChecker contrib works: use a very
custom index designed for the application (not just look at the terms in
the main corpus) and custom logic for using that index.



-Hoss



RE: Solr 1.2 released

2007-06-08 Thread Teruhiko Kurosaka
I noticed there is no example/ext 
directory or jars that was found there 
in 1.1 (commons-el.jar, commons-logging.jar, 
jasper-*.jar, mx4j-*.jar)

I have a jar that my Solr plugin depends on.
This jar contains a class that needs to be
loaded only once per container because
it is a JNI library.  For that reason, it
cannot be placed in a per-webapp lib
directory. (I am assuming placing the jars
in example/solr/lib is same as placing them
in each web app's WEB-INF/lib, from the
class loading point of view.  Am I right?)

I tried putting this jar in
the top-level lib and example/solr/lib,
but the jar wasn't recognized.

Where should I put jars shared by multiple
shared apps?


BTW, in order to invsetigate this, I
inspected the start.conf file inside
start.jar and it seems the new start.jar
is expecting to find ant.jar in this 
fixed location:
/usr/share/java/

Is this intended? (I don't know why jetty
needs ant anyway.)

-kuro


Re: solr+hadoop = next solr

2007-06-08 Thread Jeff Rodenburg

On 6/7/07, Rafael Rossini [EMAIL PROTECTED] wrote:


Hi, Jeff and Mike.

   Would you mind telling us about the architecture of your solutions a
little bit? Mike, you said that you implemented a highly-distributed
search
engine using Solr as indexing nodes. What does that mean? You guys
implemented a master, multi-slave solution for replication? Or the whole
index shards for high availability and fail over?



Our solution doesn't use solr, but goes directly to lucene.  It's built on
windows, so the interop communication service is built on .net remoting (tcp
based).  Microsoft has deprecated ongoing development with .net remoting, in
favor of other more standard mechanisms, i.e. http.  So, we're looking to
migrate our solution to a more community-supported model.

The underlying structure sounds similar to what others have done: index
shards distributed to various servers, each responsible for a subset of the
index.  A merging server handles coordination of concurrent thread requests
and synchronizes the results as they're returned.  The thread coordination
and search results interleaving process is functional but not really
scalable.  It works for our user model, where users tend not to page deeply
through results.  We want to change that so we can use solr as our primary
data source read mechanism for our site.

-- j


RE: Solr 1.2 released

2007-06-08 Thread Chris Hostetter
: I noticed there is no example/ext
: directory or jars that was found there
: in 1.1 (commons-el.jar, commons-logging.jar,
: jasper-*.jar, mx4j-*.jar)

the example/ext directory was an entirly Jetty based artifact.  when we
upgraded the Jetty used in the example setup, Jetty no longer had an ext
directory, so it was removed.

: I have a jar that my Solr plugin depends on.
: This jar contains a class that needs to be
: loaded only once per container because
: it is a JNI library.  For that reason, it
: cannot be placed in a per-webapp lib
: directory. (I am assuming placing the jars
: in example/solr/lib is same as placing them
: in each web app's WEB-INF/lib, from the
: class loading point of view.  Am I right?)

not exactly, i custom classloader is constructed for the ${solr.home}/lib
directory, but it is a child loader of the Servlet Context loader, so you
are probably right about it being a poor place to put a JNI library.

: Where should I put jars shared by multiple
: shared apps?

that really depends on your servlet container.  The scaled down Jetty
instance provided is purely an *example* so that people who want to try
solr can do so without needing to download, install, and understand the
configuration of a servlet container.  If you want to use Jetty 6, then
you should read the Jetty docs to learn more about loading classes in the
system classloader.  Alternately if you liked Jetty 5 (which is what was
used in the Solr 1.1 example) you can use it ... but people really
shouldn't count on the servlet container provided to power the example
behaving consistent as new versions of Solr come out -- it might switch to
tomcat in the next version, it all depneds on which was is simpler,
smaller, easier to setup for the example, etc...

: is expecting to find ant.jar in this
: fixed location:
: /usr/share/java/
:
: Is this intended? (I don't know why jetty
: needs ant anyway.)

i really can't say ... it's purely a Jetty thing.  We make no
modifications to Jetty's start.jar




-Hoss