date:20090702

Hi,

Recently I've posted a question regarding using stop words in a PhraseQuery
and in a MoreLikeThis query in the same app. I posted it twice.
Unfortunately I didn't get any responses. I realize that the question might
not have been formulated clearly. So let me reformulate it.

Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in
the same app taking into account the fact that for the former to work the
stop words list needs to be included and this results in the latter putting
stop words among the most important words? Or these two queries need to use
two different indexes and thus have to be implemented in different
applications or in different cores of Solr (with different schema.xml files:
one with the StopWord Filter and another without it.)?

Any opinion will be highly appreciated.

Thank you.

Redards,
Sergey Goldberg

P.S. Just for the reference, here is my original message.

1. There're 3 kinds of searches in my application: a) PhraseQuery search; b)
search for separate words; c) MLT search. The problem I encountered is in
the use of a stop words list. If I don't take it into account, the MLT query
picks up common words as the most important words what is not right. And
when I use it, the PhraseQuery stops working. I tried it with the ps and qs
parameters (ps=100, qs=100) but that didn't change anything. (Both indexed
fields are of type text, the StandardAnalyzer is applied, and all docs are
in English.)

2. Do I understand it right that the query
q=id:1mlt=truemlt.fl=content...
should bring back documents where the most important words are in the set of
those for the doc with id=1?
--
View this message in context:
http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24303817.html
Sent from the Solr - User mailing list archive at Nabble.com.

Creating spellchecker dictionary from multiple sources

2009-07-02 Thread Licinio Fernández Maurelo

Hello everybody, dealing with the spell checker component i'm wondering if
it's possible to generate my dictionary index based on multiple indexes
fields  and also want to  know how anyone has solve this problem.

Thx

-- 
Lici

Re: Is there any other way to load the index beside using http connection?

2009-07-02 Thread Norberto Meijome

On Wed, 1 Jul 2009 15:07:12 -0700
Francis Yakin fya...@liquid.com wrote:

 
 We have several thousands of  xml files in database that we load it to solr
 master The Database uses http  connection and transfer those files to solr
 master. Solr then  translate xml files to their lindex.
 
 We are experiencing issue with close/open connection in the firewall and very
 very slow.
 
 Is there any other way to load the data/index from Database to solr master
 beside using http connection, so it means we just scp/ftp the xml file  from
 Database system to solr master  and let solr convert those to lucene indexes?
 

Francis,
after reading the whole thread, it seems you have :
  - Data source : Oracle DB, on separate location to your SOLR.
  - Data format : XML output.
  
definitely DIH is a great option, but since you are on 1.2, not available to 
you (you should look into upgrading if you can!). 

Have you tried connecting to  SOLR over HTTP from localhost, therefore avoiding 
any firewall issues and network latency ? it should work a LOT faster than from 
a remote site. Also make sure not to commit until you really needed.

Other alternatives are to transform the XML into csv and import it that way. Or 
write a simple app that will parse the xml and post it directly using the 
embedded solr method.

plenty of options, all of them documented @ solr's site.

good luck,
b 
_
{Beto|Norberto|Numard} Meijome

People demand freedom of speech to make up for the freedom of thought which 
they avoid.  
  Soren Aabye Kierkegaard

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: Is it problem? I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old))

2009-07-02 Thread Norberto Meijome

On Thu, 2 Jul 2009 16:12:58 +0800
James liu liuping.ja...@gmail.com wrote:

 I use solr to search and index is made by lucene. (not
 EmbeddedSolrServer(wiki is old))
 
 Is it problem when i use solr to search?
 
 which the difference between Index(made by lucene and solr)?

Hi James,
make sure the version of Lucene used to create your index is the same as the
libraries included in your version of SOLR. it should work.

it may be that an older lucene index works with a newer lucene-provided-in-solr
libs, but after using it you may not be able to go back , but i am not sure of
the details.

probably an FAQ by now - check the archives  :)

good luck,
B
_
{Beto|Norberto|Numard} Meijome

He has no enemies, but is intensely disliked by his friends.
  Oscar Wilde

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Implementing PhraseQuery and MoreLikeThis Query in one app

2009-07-02 Thread Michael Ludwig


SergeyG schrieb:


Can both queries - PhraseQuery and MoreLikeThis Query - be implemented
in the same app taking into account the fact that for the former to
work the stop words list needs to be included and this results in the
latter putting stop words among the most important words?


Why would the inclusion of a stopword list result in stopwords being of
top importance in the MoreLikeThis query?

Michael Ludwig

Adding shards entries in solrconfig.xml

2009-07-02 Thread Rakhi Khatwani

Hi,
 I read the following article:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

its mentioned that   its much easier to set the shards parameter for your
SearchHandler in solrcofig.xml.

i also went through:
http://www.nabble.com/newbie-question-on-SOLR-distributed-searches-with-many-%22shards%22-td20687487.html
but it gives a wage idea about setting the shards. particularly the syntax.

Can anyone given an example of setting the shards parameter in
solrconfig.xml.

Regards,
Raakhi

Re: Implementing PhraseQuery and MoreLikeThis Query in one app

Why would the inclusion of a stopword list result in stopwords being of
top importance in the MoreLikeThis query?

Michael,

I just saw some of them (words from the stop words list) in the MLT query's
response.

Sergey

SergeyG wrote:

Hi,

Recently I've posted a question regarding using stop words in a
PhraseQuery and in a MoreLikeThis query in the same app. I posted it
twice. Unfortunately I didn't get any responses. I realize that the
question might not have been formulated clearly. So let me reformulate it.

Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in
the same app taking into account the fact that for the former to work the
stop words list needs to be included and this results in the latter
putting stop words among the most important words? Or these two queries
need to use two different indexes and thus have to be implemented in
different applications or in different cores of Solr (with different
schema.xml files: one with the StopWord Filter and another without it.)?

Any opinion will be highly appreciated.

Thank you.

Redards,
Sergey Goldberg

P.S. Just for the reference, here is my original message.

1. There're 3 kinds of searches in my application: a) PhraseQuery search;
b) search for separate words; c) MLT search. The problem I encountered is
in the use of a stop words list. If I don't take it into account, the MLT
query picks up common words as the most important words what is not right.
And when I use it, the PhraseQuery stops working. I tried it with the ps
and qs parameters (ps=100, qs=100) but that didn't change anything. (Both
indexed fields are of type text, the StandardAnalyzer is applied, and all
docs are in English.)

2. Do I understand it right that the query
q=id:1mlt=truemlt.fl=content...
should bring back documents where the most important words are in the set
of those for the doc with id=1?

--
View this message in context:
http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24304705.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is it problem? I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old))


Hi,

You need to ensure that the index format is compatible (that the same Lucene 
jars are used in both cases) and that the analysis performed on fields is the 
same.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: James liu liuping.ja...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 4:12:58 AM
 Subject: Is it problem? I use solr to search and index is made by lucene. 
 (not  EmbeddedSolrServer(wiki is old))
 
 I use solr to search and index is made by lucene. (not
 EmbeddedSolrServer(wiki is old))
 
 Is it problem when i use solr to search?
 
 which the difference between Index(made by lucene and solr)?
 
 
 thks
 
 -- 
 regards
 j.L ( I live in Shanghai, China)

Re: 1.4 stable release date


Hi Andrew,

I don't think we have a specific date set.  THe best way to monitor this 
progress is probably by monitoring the number of JIRA issues set for 1.4 (Fix 
for). 


Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Andrew McCombe eupe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 4:01:21 AM
 Subject: 1.4 stable release date
 
 Hi
 
 Just wondering if there is a release date for 1.4 stable?
 
 Regards
 Andrew

Re: IndexMerge not found


Hi,

My feeling is those jars are actually not in your CLASSPATH (or in -cp).

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: James liu liuping.ja...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 2:03:19 AM
 Subject: Re: IndexMerge not found
 
 i use lucene-core-2.9-dev.jar, lucene-misc-2.9-dev.jar
 
 On Thu, Jul 2, 2009 at 2:02 PM, James liu wrote:
 
  i try http://wiki.apache.org/solr/MergingSolrIndexes
 
  system: win2003, jdk 1.6
 
  Error information:
 
  Caused by: java.lang.ClassNotFoundException:
  org.apache.lucene.misc.IndexMergeTo
  ol
  at java.net.URLClassLoader$1.run(Unknown Source)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(Unknown Source)
  at java.lang.ClassLoader.loadClass(Unknown Source)
  at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
  at java.lang.ClassLoader.loadClass(Unknown Source)
  at java.lang.ClassLoader.loadClassInternal(Unknown Source)
  Could not find the main class: org/apache/lucene/misc/IndexMergeTool.
  Program w
  ill exit.
 
 
 
  --
  regards
  j.L ( I live in Shanghai, China)
 
 
 
 
 -- 
 regards
 j.L ( I live in Shanghai, China)

Re: Is there any other way to load the index beside using http connection?


Francis,

I think both of these are on the Solr Wiki.  You'll have to figure out how to 
export from DB yourself, and you'll probably write a script/tool to read the 
export and rewrite it in the csv format.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Francis Yakin fya...@liquid.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 12:26:14 AM
 Subject: RE: Is there any other way to load the index beside using http 
 connection?
 
 
 How you import the documents as csv data/file from Oracle Database to Sol 
 master( they are two different machines)?
 
 And you have the doc for using EmbeddedSolrServer?
 
 Thanks Otis!
 
 Francis
 
 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, July 01, 2009 8:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Is there any other way to load the index beside using http 
 connection?
 
 
 Francis,
 
 There are a number of things you can do to make indexing over HTTP faster.
 You can also import documents as csv data/file.
 Finally, you can use EmbeddedSolrServer.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
  From: Francis Yakin 
  To: solr-user@lucene.apache.org 
  Sent: Wednesday, July 1, 2009 6:07:12 PM
  Subject: Is there any other way to load the index beside using http 
 connection?
 
 
  We have several thousands of  xml files in database that we load it to solr
  master
  The Database uses http  connection and transfer those files to solr 
  master.
  Solr then  translate xml files to their lindex.
 
  We are experiencing issue with close/open connection in the firewall and 
  very
  very slow.
 
  Is there any other way to load the data/index from Database to solr master
  beside using http connection, so it means we just scp/ftp the xml file  from
  Database system to solr master  and let solr convert those to lucene 
  indexes?
 
  Any input or help will be much appreciated.
 
 
  Thanks
 
  Francis

Re: Implementing PhraseQuery and MoreLikeThis Query in one app

Hi,

Rushing quickly through this one, one way you can use the same index for both
is by copying fields. One field copy would leave stopwords in (for PQ), and
the other copy would remove stopwords (for MLT). There may be more elegant
ways to accomplish this - this is the first thing that comes to mind.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: SergeyG sgoldb...@mail.ru
To: solr-user@lucene.apache.org
Sent: Thursday, July 2, 2009 5:31:21 AM
Subject: Implementing PhraseQuery and MoreLikeThis Query in one app

Hi,

Any opinion will be highly appreciated.

Thank you.

Redards,
Sergey Goldberg

P.S. Just for the reference, here is my original message.

Re: Creating spellchecker dictionary from multiple sources


Hi Lici,

I don't think the current spellchecker can look at more than one field, let 
alone multiple indices, but you could certainly modify the code and make it do 
that.  Looking at multiple fields of the same index may make more sense than 
looking at multiple indices.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Licinio Fernández Maurelo licinio.fernan...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 5:36:34 AM
 Subject: Creating spellchecker dictionary from multiple sources
 
 Hello everybody, dealing with the spell checker component i'm wondering if
 it's possible to generate my dictionary index based on multiple indexes
 fields  and also want to  know how anyone has solve this problem.
 
 Thx
 
 -- 
 Lici

Re: Implementing PhraseQuery and MoreLikeThis Query in one app


Michael - because they are the most frequent, which is how MLT selects terms to 
use for querying, IIRC.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Michael Ludwig m...@as-guides.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 6:20:05 AM
 Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app
 
 SergeyG schrieb:
 
  Can both queries - PhraseQuery and MoreLikeThis Query - be implemented
  in the same app taking into account the fact that for the former to
  work the stop words list needs to be included and this results in the
  latter putting stop words among the most important words?
 
 Why would the inclusion of a stopword list result in stopwords being of
 top importance in the MoreLikeThis query?
 
 Michael Ludwig

Re: Adding shards entries in solrconfig.xml


Rakhi,

Have you looked at Solr example directories (in Solr svn)?  There may be an 
example of it there.  From memory, the syntax is:
shardsURL1,URL2  /shards



e.g.
shardshttp://shard1:8080/solr,http://shard2:8080/solr/shards

This goes into one of the sections of the request handler configuration.  
Shards can also be specified in the shards param in the URL itself.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Rakhi Khatwani rkhatw...@gmail.com
 To: solr-user@lucene.apache.org
 Cc: ninad.r...@germinait.com
 Sent: Thursday, July 2, 2009 6:36:43 AM
 Subject: Adding shards entries in solrconfig.xml
 
 Hi,
  I read the following article:
 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
 
 its mentioned that   its much easier to set the shards parameter for your
 SearchHandler in solrcofig.xml.
 
 i also went through:
 http://www.nabble.com/newbie-question-on-SOLR-distributed-searches-with-many-%22shards%22-td20687487.html
 but it gives a wage idea about setting the shards. particularly the syntax.
 
 Can anyone given an example of setting the shards parameter in
 solrconfig.xml.
 
 Regards,
 Raakhi

Re: Adding shards entries in solrconfig.xml

2009-07-02 Thread Mark Miller


Rakhi Khatwani wrote:

Hi,
 I read the following article:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

its mentioned that   its much easier to set the shards parameter for your
SearchHandler in solrcofig.xml.

i also went through:
http://www.nabble.com/newbie-question-on-SOLR-distributed-searches-with-many-%22shards%22-td20687487.html
but it gives a wage idea about setting the shards. particularly the syntax.

Can anyone given an example of setting the shards parameter in
solrconfig.xml.

Regards,
Raakhi

  

requestHandler name=standard class=solr.SearchHandler default=true
 !-- default values for query parameters --

 lst name=defaults
 str name=echoParamsexplicit/str
 str name=shardslocalhost:8983/solr,localhost:7574/solr/str
/lst
/requestHandler

Thats an example of the syntax though. Don't do it on the standard 
request handler or you will create an infinite loop. Define a different 
handler.


--
- Mark

http://www.lucidimagination.com

Re: Creating spellchecker dictionary from multiple sources

2009-07-02 Thread Erik Hatcher

You could configure multiple spellcheckers on different fields, or if  
you want to aggregate several fields into the suggestions, use  
copyField to pool all text to be suggested together into a single field.


Erik

On Jul 2, 2009, at 7:46 AM, Otis Gospodnetic wrote:



Hi Lici,

I don't think the current spellchecker can look at more than one  
field, let alone multiple indices, but you could certainly modify  
the code and make it do that.  Looking at multiple fields of the  
same index may make more sense than looking at multiple indices.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Licinio Fernández Maurelo licinio.fernan...@gmail.com
To: solr-user@lucene.apache.org
Sent: Thursday, July 2, 2009 5:36:34 AM
Subject: Creating spellchecker dictionary from multiple sources

Hello everybody, dealing with the spell checker component i'm  
wondering if
it's possible to generate my dictionary index based on multiple  
indexes

fields  and also want to  know how anyone has solve this problem.

Thx

--
Lici

Metada document for faceted search

2009-07-02 Thread Osman İZBAT

Hi.

I'm trying to implement custom faceted search like CNET's
approach.http://www.mail-archive.com/java-u...@lucene.apache.org/msg02646.html
.
But i couldn't figure out how to structure and index category metadata
document.

Thanks.

-- 
Osman İZBAT

Re: Creating spellchecker dictionary from multiple sources

2009-07-02 Thread Licinio Fernández Maurelo

Thanks for your responses guys,

my problem is that currently we have 11 cores-index, some of them contains
fields i want to use for
spell checking and i'm thinking on build an extra-core containing the
dictionary index, and import from multiple indexes the information i need
via DIH.

Should it works, i hope

2009/7/2 Erik Hatcher e...@ehatchersolutions.com

 You could configure multiple spellcheckers on different fields, or if you
 want to aggregate several fields into the suggestions, use copyField to pool
 all text to be suggested together into a single field.

Erik


 On Jul 2, 2009, at 7:46 AM, Otis Gospodnetic wrote:


 Hi Lici,

 I don't think the current spellchecker can look at more than one field,
 let alone multiple indices, but you could certainly modify the code and make
 it do that.  Looking at multiple fields of the same index may make more
 sense than looking at multiple indices.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 

 From: Licinio Fernández Maurelo licinio.fernan...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 5:36:34 AM
 Subject: Creating spellchecker dictionary from multiple sources

 Hello everybody, dealing with the spell checker component i'm wondering
 if
 it's possible to generate my dictionary index based on multiple indexes
 fields  and also want to  know how anyone has solve this problem.

 Thx

 --
 Lici






-- 
Lici

Re: Installing a patch in a solr nightly on Windows

2009-07-02 Thread ahammad


Thanks for the suggestions:

Koji: I am aware of Cygwin. The problem is I am not sure how to do the whole
thing. I downloaded a nightly zip file and extracted it to a directory.
Where do I put the .patch file? Where do I execute the patch... command
from? It doesn't work when I do it at the root of the install.

Michael: I'll take a look at that standalone utility.

Paul: I assume that in order to do it with svn, you need to checkout the
trunk? What do you do after that? Do you have the link to the distributions?
I get OPTIONS of 'http://svn.apache.org/repos/asf/lucene/solr/trunk': could
not connect to server (http://svn.apache.org) when I try. Something tells
me that my proxy is blocking the connection. If that is the case, then I
don't think that I can do a checkout. Do you have any other alternatives?

Thanks again for the input.


ahammad wrote:
 
 Hello,
 
 I am trying to install a patch for Solr
 (https://issues.apache.org/jira/browse/SOLR-284) but I'm not sure how to
 do it in Windows.
 
 I have a copy of the nightly build, but I don't know how to proceed. I
 looked at the HowToContribute wiki for patch installation instructions,
 but there are no Windows specific instructions in there.
 
 Any help would be greatly appreciated.
 
 Thanks
 

-- 
View this message in context: 
http://www.nabble.com/Installing-a-patch-in-a-solr-nightly-on-Windows-tp24273921p24306501.html
Sent from the Solr - User mailing list archive at Nabble.com.

EnglishPorterFilterFactory and PatternReplaceFilterFactory

In Germany we have a strange habbit of seeing some sort of equivalence
between Umlaut letters and a two letter representation. Example 'ä' and
'ae' are expected to give the same search results. To achieve this I
added this filter to the text fieldtype definition:
filter class=solr.PatternReplaceFilterFactory
pattern=ä replacement=ae replace=all
/
to both index and query analyzers (and more for the other umlauts).

This works well when I search for a name (a word not stemmed) but not
e.g. with the word Wärme.
search for 'wärme' works
search for 'waerme' does not work
search for 'waerm' works if I move the EnglishPorterFilterFactory after
the PatternReplaceFilterFactory.

DebugQuery for waerme gives a parsedquery FS:waerm.
What I don't understand is why the (existing) records are not found. If
I understand it right, there should be 'waerm' in the index as well.

By the way, the reason why I keep the EnglishPorterFilterFactory is that
the records are in many languages and the English stemming gives good
results in many cases and I don't want (yet) to multiply my fields to
have language specific versions.
But even if the stemming is not right because the language is not
English I think records should be found as long as the analyzers are the
same for index and query.

This is with Solr 1.3.

Can someone shed some light on what is going on and how I can achieve my
goal?

-Michael

Making Analyzer Phrase aware?

2009-07-02 Thread mike.schultz


I was looking at the SOLR-908 port of nutch CommonGramsFilter as an approach
for having phrase searches be sensitive to stop words within a query.  So a
search on car on street wouldn't match the text car in street.

From what I can tell the query version of the filter will *always* create
stop-word-grams, not just in a phrase context.  I want non-phrase searches
to ignore stop words as usual.  Can someone tell me how to make an analyzer
(or token filter) phrase aware so I only create grams when I know I'm
inside of a phrase?

Thanks.
Mike
-- 
View this message in context: 
http://www.nabble.com/Making-Analyzer-Phrase-aware--tp24306862p24306862.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Installing a patch in a solr nightly on Windows

2009-07-02 Thread Koji Sekiguchi


ahammad wrote:

Thanks for the suggestions:

Koji: I am aware of Cygwin. The problem is I am not sure how to do the whole
thing. I downloaded a nightly zip file and extracted it to a directory.
Where do I put the .patch file? Where do I execute the patch... command
from? It doesn't work when I do it at the root of the install.

  

It should work at the root of the install:

$ patch -p0  SOLR-284.patch

Do you see an error message? What's error?

Koji

Re: Installing a patch in a solr nightly on Windows

2009-07-02 Thread ahammad


When I go to the source and I input the command, I get:

bash: patch: command not found

Thanks


Koji Sekiguchi-2 wrote:
 
 ahammad wrote:
 Thanks for the suggestions:

 Koji: I am aware of Cygwin. The problem is I am not sure how to do the
 whole
 thing. I downloaded a nightly zip file and extracted it to a directory.
 Where do I put the .patch file? Where do I execute the patch... command
 from? It doesn't work when I do it at the root of the install.

   
 It should work at the root of the install:
 
 $ patch -p0  SOLR-284.patch
 
 Do you see an error message? What's error?
 
 Koji
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Installing-a-patch-in-a-solr-nightly-on-Windows-tp24273921p24307414.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Implementing PhraseQuery and MoreLikeThis Query in one app

I think it works better to use the highest tf.idf terms, not the highest tf.
That is what I implemented for Ultraseek ten years ago. With tf, you get
lots of terms with low discrimination power.

wunder

On 7/2/09 4:48 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

 
 Michael - because they are the most frequent, which is how MLT selects terms
 to use for querying, IIRC.
 
 
 Otis --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: Michael Ludwig m...@as-guides.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 6:20:05 AM
 Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app
 
 SergeyG schrieb:
 
 Can both queries - PhraseQuery and MoreLikeThis Query - be implemented
 in the same app taking into account the fact that for the former to
 work the stop words list needs to be included and this results in the
 latter putting stop words among the most important words?
 
 Why would the inclusion of a stopword list result in stopwords being of
 top importance in the MoreLikeThis query?
 
 Michael Ludwig

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

First, don't use an English stemmer on German text. It will give some odd
results.

Are you using the same conversions on the index and query side?

The German stemmer might already handle typewriter umlauts. If it doesn't,
use the pattern replace factory. You will also need to convert ß to ss.

You really do need separate fields for each language.

Handling these characters is language-specific. The typewriter umlaut
conversion is wrong for English. It is correct, but rare, to see a diaresis
in English when vowels are pronounced separately, like coöperate. In
Swedish, it is not OK to convert ö to another letter or combination
of letters.

wunder

On 7/2/09 6:27 AM, Michael Lackhoff mich...@lackhoff.de wrote:

 In Germany we have a strange habbit of seeing some sort of equivalence
 between Umlaut letters and a two letter representation. Example 'ä' and
 'ae' are expected to give the same search results. To achieve this I
 added this filter to the text fieldtype definition:
 filter class=solr.PatternReplaceFilterFactory
 pattern=ä replacement=ae replace=all
 /
 to both index and query analyzers (and more for the other umlauts).
 
 This works well when I search for a name (a word not stemmed) but not
 e.g. with the word Wärme.
 search for 'wärme' works
 search for 'waerme' does not work
 search for 'waerm' works if I move the EnglishPorterFilterFactory after
 the PatternReplaceFilterFactory.
 
 DebugQuery for waerme gives a parsedquery FS:waerm.
 What I don't understand is why the (existing) records are not found. If
 I understand it right, there should be 'waerm' in the index as well.
 
 By the way, the reason why I keep the EnglishPorterFilterFactory is that
 the records are in many languages and the English stemming gives good
 results in many cases and I don't want (yet) to multiply my fields to
 have language specific versions.
 But even if the stemming is not right because the language is not
 English I think records should be found as long as the analyzers are the
 same for index and query.
 
 This is with Solr 1.3.
 
 Can someone shed some light on what is going on and how I can achieve my
 goal?
 
 -Michael

Re: Installing a patch in a solr nightly on Windows

2009-07-02 Thread Markus Jelsma - Buyways B.V.

You will need the patch binary as well to apply the diff to the original
file.





On Thu, 2009-07-02 at 07:10 -0700, ahammad wrote:

 When I go to the source and I input the command, I get:
 
 bash: patch: command not found
 
 Thanks
 
 
 Koji Sekiguchi-2 wrote:
  
  ahammad wrote:
  Thanks for the suggestions:
 
  Koji: I am aware of Cygwin. The problem is I am not sure how to do the
  whole
  thing. I downloaded a nightly zip file and extracted it to a directory.
  Where do I put the .patch file? Where do I execute the patch... command
  from? It doesn't work when I do it at the root of the install.
 

  It should work at the root of the install:
  
  $ patch -p0  SOLR-284.patch
  
  Do you see an error message? What's error?
  
  Koji

Re: DIH: Limited xpath syntax unable to parse all xml elements

Thanks Noble, I gave those examples a try.

If I use field column=body xpath=/book/body/chapter/p /  I only get
the text from the last p element, not from all elements.

If I use field column=body xpath=/book/body/chapter flatten=true/
or field column=body xpath=/book/body/chapter/ flatten=true/ I don't
get back anything for the body column.

So the first example is close, but it only gets the text for the last p
element. If I could get all p elements at the same level that would be
what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
supported.

Thanks,
-Jay


2009/7/1 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 complete xpath is not supported

 /book/body/chapter/p

 should work.

 if you wish all the text under chapter irrespective of nesting , tag
 names use this
 field column=body xpath=/book/body/chapter flatten=true/






 On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote:
  I'm using the XPathEntityProcessor to parse an xml structure that looks
 like
  this:
 
  book
 authorJoe Smith/author
 titleWorld Atlas/title
 body
 chapter
 pContent I want is here/p
 pMore content I want is here./p
 pStill more content here./p
 /chapter
 /body
  /book
 
  The author and title parse out fine:   field column=title
  xpath=/book/title/  field column=author xpath=/book/author/
 
  But I can't get at the data inside the p tags. I want to get all
  non-markup text inside the body tag with something like this:
 
  field column=body xpath=/book/body/chapter//p/
 
  but that is not supported.
 
  Does anyone know of a way that I can get the content within the p tags
  without the markup?
 
  Thanks,
  -Jay
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Master Slave data distribution | rsync fail issue

2009-07-02 Thread Vicky_Dev


Yes. Permissions are same across cores

~Vikrant


Bill Au wrote:
 
 Are the user/group/permissions on the snapshot files the same for both
 cases
 (manual vs postCommit/postOptimize events)?
 
 Bill
 
 On Tue, May 5, 2009 at 12:54 PM, tushar kapoor 
 tushar_kapoor...@rediffmail.com wrote:
 

 Hi,

 I am facing an issue while performing snapshot pulling thru Snappuller
 script from slave server :
 We have the setup of multicores on Master Solr and Slave Solr servers.
 Scenario , 2 cores are set :
 i)  CORE_WWW.ABCD.COM
 ii) CORE_WWW.XYZ.COM

 rsync-enable and rsync-start script run from CORE_WWW.ABCD.COM on master
 server. Thus rsyncd.commf file got generated on CORE_WWW.ABCD.COM  only ,
 but not on CORE_WWW.XYZ.COM.
 Rsyncd.conf of CORE_WWW.ABCD.COM :
  rsyncd.conf file 
 uid = webuser
 gid = webuser
 use chroot = no
 list = no
 pid file =
 /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
 CORE_WWW.ABCD.COM/logs/rsyncd.pid
 log file =
 /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
 CORE_WWW.ABCD.COM/logs/rsyncd.log
 [solr]
path =
 /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
 CORE_WWW.ABCD.COM/data
comment = Solr

 rsync error used to get generated while doing the  pulling of master
 server
 snapshot of a particular core CORE_WWW.XYZ.COM from slave end, for core
 CORE_WWW.ABCD.COM snappuller occured without any error.

 Also, this issue is coming only when snapshot are generated at master end
 thru the way given below:
 A)  Snapshot are generated automatically by
 editing  “${SOLR_HOME}/solr/conf/solrconfig.xml” to let either commit
 index
 or optimize index trigger the snapshooter (search “postCommit” and
 “postOptimize” to find the configuration section).

 Sample of solrconfig.xml entry on Master server End:
 I)
 listener event=postCommit class=solr.RunExecutableListener
  str

 name=exe/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
 CORE_WWW.ABCD.COM/bin/snapshooter/str
  str

 name=dir/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
 CORE_WWW.ABCD.COM/bin/str
  bool name=waittrue/bool
  arr name=args strarg1/str strarg2/str /arr
  arr name=env strMYVAR=val1/str /arr
/listener

 same way done for core CORE_WWW.XYZ.COM solrConfig.xml.
 II) The  dataDir tag remains commented on both the cores .XML on master
 server.

 Log sample  for more clearity :
 rsyncd.log of the core CORE_WWW.XYZ.COM:
 2009/05/01 15:48:40 command: ./rsyncd-start
 2009/05/01 15:48:40 [15064] rsyncd version 2.6.3 starting, listening on
 port
 18983
 2009/05/01 15:48:40 rsyncd started with

 data_dir=/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
 CORE_WWW.XYZ.COm/data
 and accepting requests
 2009/05/01 15:50:36 [15195] rsync on solr/snapshot.20090501153311/ from
 deltrialmac.mac1.com (10.210.7.191)
 2009/05/01 15:50:36 [15195] rsync: link_stat snapshot.20090501153311/.
 (in
 solr) failed: No such file or directory (2)
 2009/05/01 15:50:36 [15195] rsync error: some files could not be
 transferred
 (code 23) at main.c(442)
 2009/05/01 15:52:23 [15301] rsync on solr/snapshot.20090501155030/ from
 delpearsondm.sapient.com (10.210.7.191)
 2009/05/01 15:52:23 [15301] wrote 3438 bytes  read 290 bytes  total size
 2779
 2009/05/01 16:03:31 [15553] rsync on solr/snapshot.20090501160112/ from
 deltrialmac.mac1.com (10.210.7.191)
 2009/05/01 16:03:31 [15553] rsync: link_stat snapshot.20090501160112/.
 (in
 solr) failed: No such file or directory (2)
 2009/05/01 16:03:31 [15553] rsync error: some files could not be
 transferred
 (code 23) at main.c(442)
 2009/05/01 16:04:27 [15674] rsync on solr/snapshot.20090501160054/ from
 deltrialmac.mac1.com (10.210.7.191)
 2009/05/01 16:04:27 [15674] wrote 4173214 bytes  read 290 bytes  total
 size
 4174633

 I m unable to figure out that from where /. gets appeneded at the end
 snapshot.20090501153311/.
 Snappuller.log
 2009/05/04 16:55:43 started by solrUser
 2009/05/04 16:55:43 command:
 /opt/apache-solr-1.3.0/example/solr/multicore/
 CORE_WWW.PUFFINBOOKS.CA/bin/snappuller
 -u http://CORE_WWW.PUFFINBOOKS.CA/bin/snappuller%0A-u webuser
 2009/05/04 16:55:52 pulling snapshot snapshot.20090504164935
 2009/05/04 16:56:09 rsync failed
 2009/05/04 16:56:24 failed (elapsed time: 41 sec)

 Error shown on console :
 rsync: link_stat snapshot.20090504164935/. (in solr) failed: No such
 file
 or directory (2)
 client: nothing to do: perhaps you need to specify some filenames or the
 --recursive option?
 rsync error: some files could not be transferred (code 23) at main.c(723)

 B) The same issue is not coming while manually running the Snapshot
 script
 after reguler interval of time at Master server and then running
 Snappuller
 script at slave end for multiple cores. The postCommit/postOptimize part
 of
 solrConfig.xml has been commented.
 Here also rsync script run thru the core CORE_WWW.ABCD.COM. Snappuller
 and
 snapinstaller occurred

Re: Is there any other way to load the index beside using http connection?

2009-07-02 Thread Glen Newton

LuSql can be found here:
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
User Manual:
 http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

LuSql can communicate directly with Oracle and create a Lucene index for you.
Of course - as mentioned by other posters - you need to make sure the
versions of Lucene and Solr are compatible (use same jars), you use
the same Analyzers, and you create the appropriate 'schema' that Solr
understands.

-glen

2009/7/2 Francis Yakin fya...@liquid.com:

 Glen,

 Database we use is Oracle, I am not the database administrator, so I don't 
 familiar with their script.
 SO, basically we have the Oracle SQL script to load the XML files over HTTP 
 connection to our Solr Master.

 My question is there any other way instead of using HTTP connection to load 
 the XML files to our SOLR Master?

 You mentioned about LuSql, I am not familiar with that. Can you provide us 
 the docs or something? Again I am not the database Guys, I am only the solr 
 Guy. The database we have is a different box than Solr master and both are 
 running linux(RedHat).

 Thanks

 Francis

 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com]
 Sent: Wednesday, July 01, 2009 8:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Is there any other way to load the index beside using http 
 connection?

 You can directly load to the backend Lucene using LuSql[1]. It is
 faster than Solr, sometimes as much as an order of magnitude faster.

 Disclosure: I am the author of LuSql

 -Glen
 http://zzzoot.blogspot.com/

 [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

 2009/7/1 Francis Yakin fya...@liquid.com:

 We have several thousands of  xml files in database that we load it to solr 
 master
 The Database uses http  connection and transfer those files to solr 
 master. Solr then  translate xml files to their lindex.

 We are experiencing issue with close/open connection in the firewall and 
 very very slow.

 Is there any other way to load the data/index from Database to solr master 
 beside using http connection, so it means we just scp/ftp the xml file  from 
 Database system to solr master  and let solr convert those to lucene indexes?

 Any input or help will be much appreciated.


 Thanks

 Francis







 --

 -




-- 

-

Re: Is there any other way to load the index beside using http connection?

2009-07-02 Thread Glen Newton

 Are you saying that we have to use LuSql replacing our Solr?
To load your data: Yes, it is an option
To search your data: No, LuSql is only a loading tool

-glen

2009/7/2 Francis Yakin fya...@liquid.com:

 Glen,

 Are you saying that we have to use LuSql replacing our Solr?

 Francis

 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com]
 Sent: Wednesday, July 01, 2009 8:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Is there any other way to load the index beside using http 
 connection?

 You can directly load to the backend Lucene using LuSql[1]. It is
 faster than Solr, sometimes as much as an order of magnitude faster.

 Disclosure: I am the author of LuSql

 -Glen
 http://zzzoot.blogspot.com/

 [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

 2009/7/1 Francis Yakin fya...@liquid.com:

 We have several thousands of  xml files in database that we load it to solr 
 master
 The Database uses http  connection and transfer those files to solr 
 master. Solr then  translate xml files to their lindex.

 We are experiencing issue with close/open connection in the firewall and 
 very very slow.

 Is there any other way to load the data/index from Database to solr master 
 beside using http connection, so it means we just scp/ftp the xml file  from 
 Database system to solr master  and let solr convert those to lucene indexes?

 Any input or help will be much appreciated.


 Thanks

 Francis







 --

 -




-- 

-

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

2009-07-02 Thread Erick Erickson

I'm shooting a bit in the dark here, but I'd guess that these are
actually understandable results.

If you replace then stem, the stemming algorithm
works on the exact same word. And you got the
results you expect.

If you stem then replace, the inputs are different to thestemmer, so the
fact that your outputs are different
isn't a surprise.

That is your implicit assumption, it seems to me, is that'wärme'  and
'waerme' should go through the stemmer and
become 'wärm'  and 'waerm', that you can then do the substitution
on and produce the same output. I don't think that's a valid
assumption.

You could probably check the actual contents of your index
with Luke and verify whether your assumptions are correct
or not

Best
Erick

On Thu, Jul 2, 2009 at 9:27 AM, Michael Lackhoff mich...@lackhoff.dewrote:

 In Germany we have a strange habbit of seeing some sort of equivalence
 between Umlaut letters and a two letter representation. Example 'ä' and
 'ae' are expected to give the same search results. To achieve this I
 added this filter to the text fieldtype definition:
filter class=solr.PatternReplaceFilterFactory
pattern=ä replacement=ae replace=all
/
 to both index and query analyzers (and more for the other umlauts).

 This works well when I search for a name (a word not stemmed) but not
 e.g. with the word Wärme.
 search for 'wärme' works
 search for 'waerme' does not work
 search for 'waerm' works if I move the EnglishPorterFilterFactory after
 the PatternReplaceFilterFactory.

 DebugQuery for waerme gives a parsedquery FS:waerm.
 What I don't understand is why the (existing) records are not found. If
 I understand it right, there should be 'waerm' in the index as well.

 By the way, the reason why I keep the EnglishPorterFilterFactory is that
 the records are in many languages and the English stemming gives good
 results in many cases and I don't want (yet) to multiply my fields to
 have language specific versions.
 But even if the stemming is not right because the language is not
 English I think records should be found as long as the analyzers are the
 same for index and query.

 This is with Solr 1.3.

 Can someone shed some light on what is going on and how I can achieve my
 goal?

 -Michael

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

On 02.07.2009 16:34 Walter Underwood wrote:

 First, don't use an English stemmer on German text. It will give some odd
 results.

I know but at the moment I only have the choice between no stemmer at
all and one stemmer and since more than half of the records are English
(about 60% English, 30% German, some Italian, French and others) the
results are not too bad.

 Are you using the same conversions on the index and query side?

Yes, index and query look exactly the same. That is what I don't
understand. I am not complaining about a misbehaving stemmer, unless it
does already something odd with the umlauts.

 The German stemmer might already handle typewriter umlauts. If it doesn't,
 use the pattern replace factory. You will also need to convert ß to ss.

That is what I tried. And yes I also have a filter for ß to ss. It
just doesn't work as expected.

 You really do need separate fields for each language.

Eventually. But now I have to get ready really soon with a small
application and people don't find what they expect.

 Handling these characters is language-specific. The typewriter umlaut
 conversion is wrong for English. It is correct, but rare, to see a diaresis
 in English when vowels are pronounced separately, like coöperate. In
 Swedish, it is not OK to convert ö to another letter or combination
 of letters.

It is just for German users and at the moment it would be totally ok to
have coöperate indexed as cooeperate, I know it is wrong and it will
be fixed but given the tight schedule all I want at the moment is the
combination of some stemming (perhaps 70% right or more) and typewriter
umlauts (perhaps 90% correct, you gave examples for the missing 10%).

Do I have any chance?

-Michael

Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Fergus McMenemie

Thanks Noble, I gave those examples a try.

If I use field column=body xpath=/book/body/chapter/p /  I only get
the text from the last p element, not from all elements.

Hm, I am sure I have done this. In your schema.xml is the
field body multiValued or not?



If I use field column=body xpath=/book/body/chapter flatten=true/
or field column=body xpath=/book/body/chapter/ flatten=true/ I don't
get back anything for the body column.

So the first example is close, but it only gets the text for the last p
element. If I could get all p elements at the same level that would be
what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
supported.

Thanks,
-Jay


2009/7/1 Noble Paul ?? Â Ë³Ë noble.p...@corp.aol.com

 complete xpath is not supported

 /book/body/chapter/p

 should work.

 if you wish all the text under chapter irrespective of nesting , tag
 names use this
 field column=body xpath=/book/body/chapter flatten=true/






 On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote:
  I'm using the XPathEntityProcessor to parse an xml structure that looks
 like
  this:
 
  book
 authorJoe Smith/author
 titleWorld Atlas/title
 body
 chapter
 pContent I want is here/p
 pMore content I want is here./p
 pStill more content here./p
 /chapter
 /body
  /book
 
  The author and title parse out fine:   field column=title
  xpath=/book/title/  field column=author xpath=/book/author/
 
  But I can't get at the data inside the p tags. I want to get all
  non-markup text inside the body tag with something like this:
 
  field column=body xpath=/book/body/chapter//p/
 
  but that is not supported.
 
  Does anyone know of a way that I can get the content within the p tags
  without the markup?
 
  Thanks,
  -Jay
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com


-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Implementing PhraseQuery and MoreLikeThis Query in one app

wunder, thank you. (Sorry, I'm not sure this is your first name). I thought
the MoreLikeThis query normally uses tf.idf of the terms when deciding what
terms are the most important (not the most frequent). And if this is not the
case, how can I change its behavior?

SergeyG wrote:

Hi,

Recently I've posted a question regarding using stop words in a
PhraseQuery and in a MoreLikeThis query in the same app. I posted it
twice. Unfortunately I didn't get any responses. I realize that the
question might not have been formulated clearly. So let me reformulate it.

Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in
the same app taking into account the fact that for the former to work the
stop words list needs to be included and this results in the latter
putting stop words among the most important words? Or these two queries
need to use two different indexes and thus have to be implemented in
different applications or in different cores of Solr (with different
schema.xml files: one with the StopWord Filter and another without it.)?

Any opinion will be highly appreciated.

Thank you.

Redards,
Sergey Goldberg

P.S. Just for the reference, here is my original message.

1. There're 3 kinds of searches in my application: a) PhraseQuery search;
b) search for separate words; c) MLT search. The problem I encountered is
in the use of a stop words list. If I don't take it into account, the MLT
query picks up common words as the most important words what is not right.
And when I use it, the PhraseQuery stops working. I tried it with the ps
and qs parameters (ps=100, qs=100) but that didn't change anything. (Both
indexed fields are of type text, the StandardAnalyzer is applied, and all
docs are in English.)

2. Do I understand it right that the query
q=id:1mlt=truemlt.fl=content...
should bring back documents where the most important words are in the set
of those for the doc with id=1?

--
View this message in context:
http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24309831.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH: Limited xpath syntax unable to parse all xml elements

It is not multivalued. The intention is to get all text under they body
element into one body field in the index that is not multivalued.
Essentially everything within the body element minus the markup.

Thanks,
-Jay


On Thu, Jul 2, 2009 at 8:55 AM, Fergus McMenemie fer...@twig.me.uk wrote:

 Thanks Noble, I gave those examples a try.
 
 If I use field column=body xpath=/book/body/chapter/p /  I only get
 the text from the last p element, not from all elements.

 Hm, I am sure I have done this. In your schema.xml is the
 field body multiValued or not?


 
 If I use field column=body xpath=/book/body/chapter flatten=true/
 or field column=body xpath=/book/body/chapter/ flatten=true/ I
 don't
 get back anything for the body column.
 
 So the first example is close, but it only gets the text for the last p
 element. If I could get all p elements at the same level that would be
 what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
 supported.
 
 Thanks,
 -Jay
 
 
 2009/7/1 Noble Paul ??  Â Ë³Ë noble.p...@corp.aol.com
 
  complete xpath is not supported
 
  /book/body/chapter/p
 
  should work.
 
  if you wish all the text under chapter irrespective of nesting , tag
  names use this
  field column=body xpath=/book/body/chapter flatten=true/
 
 
 
 
 
 
  On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote:
   I'm using the XPathEntityProcessor to parse an xml structure that
 looks
  like
   this:
  
   book
  authorJoe Smith/author
  titleWorld Atlas/title
  body
  chapter
  pContent I want is here/p
  pMore content I want is here./p
  pStill more content here./p
  /chapter
  /body
   /book
  
   The author and title parse out fine:   field column=title
   xpath=/book/title/  field column=author xpath=/book/author/
  
   But I can't get at the data inside the p tags. I want to get all
   non-markup text inside the body tag with something like this:
  
   field column=body xpath=/book/body/chapter//p/
  
   but that is not supported.
  
   Does anyone know of a way that I can get the content within the p
 tags
   without the markup?
  
   Thanks,
   -Jay
  
 
 
 
  --
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 

 --

 ===
 Fergus McMenemie   
 Email:fer...@twig.me.ukemail%3afer...@twig.me.uk
 Techmore Ltd   Phone:(UK) 07721 376021

 Unix/Mac/Intranets Analyst Programmer
 ===

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

Also, check out MappingCharFilterFactory in Solr 1.4
and mapping-ISOLatin1Accent.txt in example/solr/conf

-Yonik
http://www.lucidimagination.com


On Thu, Jul 2, 2009 at 9:27 AM, Michael Lackhoffmich...@lackhoff.de wrote:
 In Germany we have a strange habbit of seeing some sort of equivalence
 between Umlaut letters and a two letter representation. Example 'ä' and
 'ae' are expected to give the same search results. To achieve this I
 added this filter to the text fieldtype definition:
        filter class=solr.PatternReplaceFilterFactory
                pattern=ä replacement=ae replace=all
        /
 to both index and query analyzers (and more for the other umlauts).

 This works well when I search for a name (a word not stemmed) but not
 e.g. with the word Wärme.
 search for 'wärme' works
 search for 'waerme' does not work
 search for 'waerm' works if I move the EnglishPorterFilterFactory after
 the PatternReplaceFilterFactory.

 DebugQuery for waerme gives a parsedquery FS:waerm.
 What I don't understand is why the (existing) records are not found. If
 I understand it right, there should be 'waerm' in the index as well.

 By the way, the reason why I keep the EnglishPorterFilterFactory is that
 the records are in many languages and the English stemming gives good
 results in many cases and I don't want (yet) to multiply my fields to
 have language specific versions.
 But even if the stemming is not right because the language is not
 English I think records should be found as long as the analyzers are the
 same for index and query.

 This is with Solr 1.3.

 Can someone shed some light on what is going on and how I can achieve my
 goal?

 -Michael

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

On 02.07.2009 17:28 Erick Erickson wrote:

 I'm shooting a bit in the dark here, but I'd guess that these are
 actually understandable results.

Perhaps not too much in the dark

 That is your implicit assumption, it seems to me, is that'wärme'  and
 'waerme' should go through the stemmer and
 become 'wärm'  and 'waerm', that you can then do the substitution
 on and produce the same output. I don't think that's a valid
 assumption.

Sounds very reasonable. Will see what I can make out of all this to keep
our librarians happy...

Yonik Seeley wrote:

 Also, check out MappingCharFilterFactory in Solr 1.4
 and mapping-ISOLatin1Accent.txt in example/solr/conf

Thanks for the hint, looking forward to the 1.4 release ;-) at the
moment we are on 1.3 though, I hope to upgrade soon but probably not
soon enough for this app.

-Michael

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

You might try a German stemmer. English gets a small benefit from stemming,
maybe 5%. German is more heavily inflected than English, so may get a bigger
improvement.

German search usually needs wordbreaking, so that Orgelmusik can be split
into Orgel and Musik. To get that, you will probably need a commercial
stemmer.

wunder

On 7/2/09 8:42 AM, Michael Lackhoff mich...@lackhoff.de wrote:

 On 02.07.2009 16:34 Walter Underwood wrote:
 
 First, don't use an English stemmer on German text. It will give some odd
 results.
 
 I know but at the moment I only have the choice between no stemmer at
 all and one stemmer and since more than half of the records are English
 (about 60% English, 30% German, some Italian, French and others) the
 results are not too bad.
 
 Are you using the same conversions on the index and query side?
 
 Yes, index and query look exactly the same. That is what I don't
 understand. I am not complaining about a misbehaving stemmer, unless it
 does already something odd with the umlauts.
 
 The German stemmer might already handle typewriter umlauts. If it doesn't,
 use the pattern replace factory. You will also need to convert ß to ss.
 
 That is what I tried. And yes I also have a filter for ß to ss. It
 just doesn't work as expected.
 
 You really do need separate fields for each language.
 
 Eventually. But now I have to get ready really soon with a small
 application and people don't find what they expect.
 
 Handling these characters is language-specific. The typewriter umlaut
 conversion is wrong for English. It is correct, but rare, to see a diaresis
 in English when vowels are pronounced separately, like coöperate. In
 Swedish, it is not OK to convert ö to another letter or combination
 of letters.
 
 It is just for German users and at the moment it would be totally ok to
 have coöperate indexed as cooeperate, I know it is wrong and it will
 be fixed but given the tight schedule all I want at the moment is the
 combination of some stemming (perhaps 70% right or more) and typewriter
 umlauts (perhaps 90% correct, you gave examples for the missing 10%).
 
 Do I have any chance?
 
 -Michael

Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Shalin Shekhar Mangar

On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote:

 It looks like DIH implements its own subset of the Xpath spec.


Right, DIH has a streaming implementation supporting a subset of XPath only.
The supported things are in the wiki examples.


 I don't see any tests with multiple matching sub nodes, so perhaps DIH
 Xpath does not properly support that and just selects the last matching
 node?


It selects all matching nodes. But if the field is not multi-valued, it will
store only the last value. I guess this is what is happening here.

-- 
Regards,
Shalin Shekhar Mangar.

Re: multi-word synonyms with multiple matches


: vp,vice president
: svp,senior vice president
: 
: However, a search for vp does not return results where the title is 
: senior vice president.  It appears that the term vp is not indexed 
: when there is a longer string that matches a different synonym.  Is this 
: by design, and is there any way to make solr index all synonyms that 
: match a term, even if it is contained in a longer synonym?  Thanks!

You haven't given us the full details on how you are using the 
SynonymFilterFactory (expand true or false?) but in general: yes the 
SynonymFilter finds the longest match it can.

if every svp is also a vp, then being explict in your synonyms (when doing 
index time expansion) should work...

vp,vice president
svp,senior vice president=vp,svp,senior vice president



-Hoss

Re: Master Slave data distribution | rsync fail issue

2009-07-02 Thread Bill Au

You can add the -V option to both your automatic and manual invocation of
snappuller and snapinstaller tor both core and compare the debug info.

Bill

On Thu, Jul 2, 2009 at 11:02 AM, Vicky_Dev
vikrantv_shirbh...@yahoo.co.inwrote:


 Yes. Permissions are same across cores

 ~Vikrant


 Bill Au wrote:
 
  Are the user/group/permissions on the snapshot files the same for both
  cases
  (manual vs postCommit/postOptimize events)?
 
  Bill
 
  On Tue, May 5, 2009 at 12:54 PM, tushar kapoor 
  tushar_kapoor...@rediffmail.com wrote:
 
 
  Hi,
 
  I am facing an issue while performing snapshot pulling thru Snappuller
  script from slave server :
  We have the setup of multicores on Master Solr and Slave Solr servers.
  Scenario , 2 cores are set :
  i)  CORE_WWW.ABCD.COM
  ii) CORE_WWW.XYZ.COM
 
  rsync-enable and rsync-start script run from CORE_WWW.ABCD.COM on
 master
  server. Thus rsyncd.commf file got generated on CORE_WWW.ABCD.COM  only
 ,
  but not on CORE_WWW.XYZ.COM.
  Rsyncd.conf of CORE_WWW.ABCD.COM :
   rsyncd.conf file 
  uid = webuser
  gid = webuser
  use chroot = no
  list = no
  pid file =
  /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
  CORE_WWW.ABCD.COM/logs/rsyncd.pid
  log file =
  /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
  CORE_WWW.ABCD.COM/logs/rsyncd.log
  [solr]
 path =
  /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
  CORE_WWW.ABCD.COM/data
 comment = Solr
 
  rsync error used to get generated while doing the  pulling of master
  server
  snapshot of a particular core CORE_WWW.XYZ.COM from slave end, for
 core
  CORE_WWW.ABCD.COM snappuller occured without any error.
 
  Also, this issue is coming only when snapshot are generated at master
 end
  thru the way given below:
  A)  Snapshot are generated automatically by
  editing  “${SOLR_HOME}/solr/conf/solrconfig.xml” to let either commit
  index
  or optimize index trigger the snapshooter (search “postCommit” and
  “postOptimize” to find the configuration section).
 
  Sample of solrconfig.xml entry on Master server End:
  I)
  listener event=postCommit class=solr.RunExecutableListener
   str
 
 
 name=exe/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
  CORE_WWW.ABCD.COM/bin/snapshooter/str
   str
 
 
 name=dir/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
  CORE_WWW.ABCD.COM/bin/str
   bool name=waittrue/bool
   arr name=args strarg1/str strarg2/str /arr
   arr name=env strMYVAR=val1/str /arr
 /listener
 
  same way done for core CORE_WWW.XYZ.COM solrConfig.xml.
  II) The  dataDir tag remains commented on both the cores .XML on
 master
  server.
 
  Log sample  for more clearity :
  rsyncd.log of the core CORE_WWW.XYZ.COM:
  2009/05/01 15:48:40 command: ./rsyncd-start
  2009/05/01 15:48:40 [15064] rsyncd version 2.6.3 starting, listening on
  port
  18983
  2009/05/01 15:48:40 rsyncd started with
 
 
 data_dir=/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/
  CORE_WWW.XYZ.COm/data
  and accepting requests
  2009/05/01 15:50:36 [15195] rsync on solr/snapshot.20090501153311/ from
  deltrialmac.mac1.com (10.210.7.191)
  2009/05/01 15:50:36 [15195] rsync: link_stat snapshot.20090501153311/.
  (in
  solr) failed: No such file or directory (2)
  2009/05/01 15:50:36 [15195] rsync error: some files could not be
  transferred
  (code 23) at main.c(442)
  2009/05/01 15:52:23 [15301] rsync on solr/snapshot.20090501155030/ from
  delpearsondm.sapient.com (10.210.7.191)
  2009/05/01 15:52:23 [15301] wrote 3438 bytes  read 290 bytes  total size
  2779
  2009/05/01 16:03:31 [15553] rsync on solr/snapshot.20090501160112/ from
  deltrialmac.mac1.com (10.210.7.191)
  2009/05/01 16:03:31 [15553] rsync: link_stat snapshot.20090501160112/.
  (in
  solr) failed: No such file or directory (2)
  2009/05/01 16:03:31 [15553] rsync error: some files could not be
  transferred
  (code 23) at main.c(442)
  2009/05/01 16:04:27 [15674] rsync on solr/snapshot.20090501160054/ from
  deltrialmac.mac1.com (10.210.7.191)
  2009/05/01 16:04:27 [15674] wrote 4173214 bytes  read 290 bytes  total
  size
  4174633
 
  I m unable to figure out that from where /. gets appeneded at the end
  snapshot.20090501153311/.
  Snappuller.log
  2009/05/04 16:55:43 started by solrUser
  2009/05/04 16:55:43 command:
  /opt/apache-solr-1.3.0/example/solr/multicore/
  CORE_WWW.PUFFINBOOKS.CA/bin/snappuller
  -u http://CORE_WWW.PUFFINBOOKS.CA/bin/snappuller%0A-u webuser
  2009/05/04 16:55:52 pulling snapshot snapshot.20090504164935
  2009/05/04 16:56:09 rsync failed
  2009/05/04 16:56:24 failed (elapsed time: 41 sec)
 
  Error shown on console :
  rsync: link_stat snapshot.20090504164935/. (in solr) failed: No such
  file
  or directory (2)
  client: nothing to do: perhaps you need to specify some filenames or the
  --recursive option?
  rsync error: some files could not be transferred (code 23) at
 main.c(723)

RE: Is there any other way to load the index beside using http connection?

2009-07-02 Thread Francis Yakin

Norberto, Thanks for your input.

What do you mean with Have you tried connecting to  SOLR over HTTP from 
localhost, therefore avoiding any firewall issues and network latency ? it 
should work a LOT faster than from a remote site. ?

Here are how our servers lay out:

1) Database ( Oracle ) is running on separate machine
2) Solr master is running on separate machine by itself
3) 6 solr slaves ( these 6 pulll the index from master using rsync)

We have a SQL(Oracle) script to post the data/index from Oracle Database 
machine to Solr Master over http.
We wrote those script(Someone in Oracle Database administrator write it).

In Solr master configuration we have scripts.conf that like this:

user=
solr_hostname=localhost
solr_port=7001
rsyncd_port=18983
data_dir=
webapp_name=solr
master_host=localhost
master_data_dir=solr/snapshot
master_status_dir=solr/status

So, basically from Oracle system we launch the Oracle/SQL script posting the 
data to Solr Master using
http://solrmaster/solr/update ( inside the SQL script we put this).

We can not do localhost since it's solr is not running on Oracle machine.

Another alternative that we think of is to transform XML into CSV and 
import/export it.

How about if LUSQL, some mentioned about this? Is this apps free(open source) 
application? Do you have any experience with this apps?

Thanks All for your valuable suggestions!

Francis


-Original Message-
From: Norberto Meijome [mailto:numard...@gmail.com]
Sent: Thursday, July 02, 2009 3:01 AM
To: solr-user@lucene.apache.org
Cc: Francis Yakin
Subject: Re: Is there any other way to load the index beside using http 
connection?

On Wed, 1 Jul 2009 15:07:12 -0700
Francis Yakin fya...@liquid.com wrote:


 We have several thousands of  xml files in database that we load it to solr
 master The Database uses http  connection and transfer those files to solr
 master. Solr then  translate xml files to their lindex.

 We are experiencing issue with close/open connection in the firewall and very
 very slow.

 Is there any other way to load the data/index from Database to solr master
 beside using http connection, so it means we just scp/ftp the xml file  from
 Database system to solr master  and let solr convert those to lucene indexes?


Francis,
after reading the whole thread, it seems you have :
  - Data source : Oracle DB, on separate location to your SOLR.
  - Data format : XML output.

definitely DIH is a great option, but since you are on 1.2, not available to 
you (you should look into upgrading if you can!).

Have you tried connecting to  SOLR over HTTP from localhost, therefore avoiding 
any firewall issues and network latency ? it should work a LOT faster than from 
a remote site. Also make sure not to commit until you really needed.

Other alternatives are to transform the XML into csv and import it that way. Or 
write a simple app that will parse the xml and post it directly using the 
embedded solr method.

plenty of options, all of them documented @ solr's site.

good luck,
b
_
{Beto|Norberto|Numard} Meijome

People demand freedom of speech to make up for the freedom of thought which 
they avoid. 
  Soren Aabye Kierkegaard

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

RE: Is there any other way to load the index beside using http connection?

2009-07-02 Thread Francis Yakin


Glen,

Is this LuSql is free? Is that an open source.
Is that requires a separate machine with Solr Master

I forgot to tell you that we have Master/Slaves environment of Solr.

The Database is running Oracle and it's separate machine that running in 
different network than Master and Slaves Solr(There is a firewall between 
Oracle machine and Solr Machines).
If we have LuSql Machine, do you think it's better to put into the same network 
with DataBase machine or Solr machines?
Do I need to create a sql script to get the data from Oarcle and loading it 
using LuSql and convert it to Lucene index, and how solr master will get that 
data?


Thanks

Francis


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com]
Sent: Thursday, July 02, 2009 8:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any other way to load the index beside using http 
connection?

LuSql can be found here:
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
User Manual:
 http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

LuSql can communicate directly with Oracle and create a Lucene index for you.
Of course - as mentioned by other posters - you need to make sure the
versions of Lucene and Solr are compatible (use same jars), you use
the same Analyzers, and you create the appropriate 'schema' that Solr
understands.

-glen

2009/7/2 Francis Yakin fya...@liquid.com:

 Glen,

 Database we use is Oracle, I am not the database administrator, so I don't 
 familiar with their script.
 SO, basically we have the Oracle SQL script to load the XML files over HTTP 
 connection to our Solr Master.

 My question is there any other way instead of using HTTP connection to load 
 the XML files to our SOLR Master?

 You mentioned about LuSql, I am not familiar with that. Can you provide us 
 the docs or something? Again I am not the database Guys, I am only the solr 
 Guy. The database we have is a different box than Solr master and both are 
 running linux(RedHat).

 Thanks

 Francis

 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com]
 Sent: Wednesday, July 01, 2009 8:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Is there any other way to load the index beside using http 
 connection?

 You can directly load to the backend Lucene using LuSql[1]. It is
 faster than Solr, sometimes as much as an order of magnitude faster.

 Disclosure: I am the author of LuSql

 -Glen
 http://zzzoot.blogspot.com/

 [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

 2009/7/1 Francis Yakin fya...@liquid.com:

 We have several thousands of  xml files in database that we load it to solr 
 master
 The Database uses http  connection and transfer those files to solr 
 master. Solr then  translate xml files to their lindex.

 We are experiencing issue with close/open connection in the firewall and 
 very very slow.

 Is there any other way to load the data/index from Database to solr master 
 beside using http connection, so it means we just scp/ftp the xml file  from 
 Database system to solr master  and let solr convert those to lucene indexes?

 Any input or help will be much appreciated.


 Thanks

 Francis







 --

 -




--

-

Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Mark Miller


Shalin Shekhar Mangar wrote:

On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote:

  

It looks like DIH implements its own subset of the Xpath spec.




Right, DIH has a streaming implementation supporting a subset of XPath only.
The supported things are in the wiki examples.


  

I don't see any tests with multiple matching sub nodes, so perhaps DIH
Xpath does not properly support that and just selects the last matching
node?




It selects all matching nodes. But if the field is not multi-valued, it will
store only the last value. I guess this is what is happening here.

  
So do you think it should match them all and add the concatenated text 
as one field?


That would be more Xpath like I think, and less arbitrary than just 
choosing the last one.


--
- Mark

http://www.lucidimagination.com

Re: Changing the score of a document based on the value of a field


: The SolrRelevancyFAQ has a heading that's the same as my message's subject:
: 
: 
http://wiki.apache.org/solr/SolrRelevancyFAQ#head-f013f5f2811e3ed28b200f326dd686afa491be5e
: 
: There's a TODO on the wiki to provide an actual example. Does anybody happen
: to have an example handy that I could model my query after? Thank you

the types of things that are possible was alreayd pretty clear if you read 
up on function queries, but i went ahead and added some simple examples.



-Hoss

Re: Is there any other way to load the index beside using http connection?

2009-07-02 Thread Glen Newton

2009/7/2 Francis Yakin fya...@liquid.com:

 Glen,

 Is this LuSql is free? Is that an open source.

LuSql is an Open Source project.

 Is that requires a separate machine with Solr Master

LuSql is a Java application that runs on the command line. It connects
to a the database using JDBC and creates a local Lucene index, based
on the configuration you supply to it.

 I forgot to tell you that we have Master/Slaves environment of Solr.

 The Database is running Oracle and it's separate machine that running in 
 different network than Master and Slaves Solr(There is a firewall between 
 Oracle machine and Solr Machines).
 If we have LuSql Machine, do you think it's better to put into the same 
 network with DataBase machine or Solr machines?

LuSql is heavily multi-threaded, and can suck up the resources of all
cores (this is why it runs so fast), so you need to decide if this is
not appropriate for your database machine (i.e. if it is a production
machine). You can isolate LuSql  to specific cores using something
like numactl http://www.linuxmanpages.com/man8/numactl.8.php

 Do I need to create a sql script to get the data from Oarcle and loading it 
 using LuSql and convert it to Lucene index, and how solr master will get that 
 data?

LuSql reads from Oracle and writes to a Lucene index. You just need to
give LuSql a configuration that has it generate the appropriate index
for Solr.

thanks,
Glen
http://zzzoot.blogspot.com/search?q=lucene



 Thanks

 Francis


 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com]
 Sent: Thursday, July 02, 2009 8:22 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Is there any other way to load the index beside using http 
 connection?

 LuSql can be found here:
  http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
 User Manual:
  http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

 LuSql can communicate directly with Oracle and create a Lucene index for you.
 Of course - as mentioned by other posters - you need to make sure the
 versions of Lucene and Solr are compatible (use same jars), you use
 the same Analyzers, and you create the appropriate 'schema' that Solr
 understands.

 -glen

 2009/7/2 Francis Yakin fya...@liquid.com:

 Glen,

 Database we use is Oracle, I am not the database administrator, so I don't 
 familiar with their script.
 SO, basically we have the Oracle SQL script to load the XML files over HTTP 
 connection to our Solr Master.

 My question is there any other way instead of using HTTP connection to load 
 the XML files to our SOLR Master?

 You mentioned about LuSql, I am not familiar with that. Can you provide us 
 the docs or something? Again I am not the database Guys, I am only the solr 
 Guy. The database we have is a different box than Solr master and both are 
 running linux(RedHat).

 Thanks

 Francis

 -Original Message-
 From: Glen Newton [mailto:glen.new...@gmail.com]
 Sent: Wednesday, July 01, 2009 8:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Is there any other way to load the index beside using http 
 connection?

 You can directly load to the backend Lucene using LuSql[1]. It is
 faster than Solr, sometimes as much as an order of magnitude faster.

 Disclosure: I am the author of LuSql

 -Glen
 http://zzzoot.blogspot.com/

 [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

 2009/7/1 Francis Yakin fya...@liquid.com:

 We have several thousands of  xml files in database that we load it to solr 
 master
 The Database uses http  connection and transfer those files to solr 
 master. Solr then  translate xml files to their lindex.

 We are experiencing issue with close/open connection in the firewall and 
 very very slow.

 Is there any other way to load the data/index from Database to solr master 
 beside using http connection, so it means we just scp/ftp the xml file  
 from Database system to solr master  and let solr convert those to lucene 
 indexes?

 Any input or help will be much appreciated.


 Thanks

 Francis







 --

 -




 --

 -




-- 

-

Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Shalin Shekhar Mangar

On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com wrote:

 Shalin Shekhar Mangar wrote:


 It selects all matching nodes. But if the field is not multi-valued, it
 will
 store only the last value. I guess this is what is happening here.



 So do you think it should match them all and add the concatenated text as
 one field?

 That would be more Xpath like I think, and less arbitrary than just
 choosing the last one.


I won't call it arbitrary because it creates a SolrInputDocument with values
from all the matching nodes just like you'd create any multi-valued field.
The problem is that his field is not declared to be multi-valued. The same
would happen if you posted an XML document to /update with multiple values
for a single-valued field.

XPathEntityProcessor provides the flatten=true option if you want to add
it as concatenated test. Jay mentioned that flatten did not work for him
which is something we should investigate.

Jay, which version of Solr are you running? The flatten option is a 1.4
feature (added with SOLR-1003).
-- 
Regards,
Shalin Shekhar Mangar.

Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Fergus McMenemie

Shalin Shekhar Mangar wrote:
 On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote:

   
 It looks like DIH implements its own subset of the Xpath spec.
 


 Right, DIH has a streaming implementation supporting a subset of XPath only.
 The supported things are in the wiki examples.


   
 I don't see any tests with multiple matching sub nodes, so perhaps DIH
 Xpath does not properly support that and just selects the last matching
 node?
 


 It selects all matching nodes. But if the field is not multi-valued, it will
 store only the last value. I guess this is what is happening here.

   
So do you think it should match them all and add the concatenated text 
as one field?

That would be more Xpath like I think, and less arbitrary than just 
choosing the last one.

Only when the field in schema.xml in not multiValued. If the field is
multiValued is should still behave as at present?

Also... what went wrong with the suggested:-
field column=body xpath=/book/body/chapter flatten=true/

Regards Fergus.

Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Mark Miller


Shalin Shekhar Mangar wrote:

On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com wrote:

  

Shalin Shekhar Mangar wrote:



It selects all matching nodes. But if the field is not multi-valued, it
will
store only the last value. I guess this is what is happening here.



  

So do you think it should match them all and add the concatenated text as
one field?

That would be more Xpath like I think, and less arbitrary than just
choosing the last one.




I won't call it arbitrary because it creates a SolrInputDocument with values
from all the matching nodes just like you'd create any multi-valued field.
  
Then shouldnt it throw an error? If your field is not multivalued, but 
the XML is multivalued, it does seem arbitrary to pick the last node 
when Xpath says to select them all.


It seems it should through an error (saying to use flatten or a 
multifield?) or concatenate all the text?


--
- Mark

http://www.lucidimagination.com

Re: DIH: Limited xpath syntax unable to parse all xml elements

I'm on the trunk, built on July 2: 1.4-dev 789506

Thanks,
-Jay

On Thu, Jul 2, 2009 at 11:33 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com
 wrote:

  Shalin Shekhar Mangar wrote:
 
 
  It selects all matching nodes. But if the field is not multi-valued, it
  will
  store only the last value. I guess this is what is happening here.
 
 
 
  So do you think it should match them all and add the concatenated text as
  one field?
 
  That would be more Xpath like I think, and less arbitrary than just
  choosing the last one.
 

 I won't call it arbitrary because it creates a SolrInputDocument with
 values
 from all the matching nodes just like you'd create any multi-valued field.
 The problem is that his field is not declared to be multi-valued. The same
 would happen if you posted an XML document to /update with multiple values
 for a single-valued field.

 XPathEntityProcessor provides the flatten=true option if you want to add
 it as concatenated test. Jay mentioned that flatten did not work for him
 which is something we should investigate.

 Jay, which version of Solr are you running? The flatten option is a 1.4
 feature (added with SOLR-1003).
 --
 Regards,
 Shalin Shekhar Mangar.

Preparing the ground for a real multilang index

As pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how?

The first problem is how to know the language. Sometimes I have a
language identifier within the record, sometimes I have more than one,
sometimes I have none. How should I handle the non-obvious cases?

Given I somehow know record1 is English and record2 is German. Then I
need all my (relevant) fields for every language, e.g. I will have
TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
what with exotic languages? Use a catch all language without a stemmer?

Now a user searches for TITLE:term and I don't know beforehand the
language of term. Do I have to expand the query to something like
TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there
some sort of copyfield for analyzed fields? Then I could just copy all
the TITLE_* fields to TITLE and don't bother with the language of the query.

Are there any solutions that prevent an index with thousands of fields
and dozens of ORed query terms?

I know I will have to implement some better multilanguage support but
would also like to keep it as simple as possible.

-Michael

Re: DIH: Limited xpath syntax unable to parse all xml elements

Thanks Fergus, setting the field to multivalued did work:
  field column=body xpath=/book/body/chapter/p flatten=true/
gets all the p elements as multivalue fields in the body field.

The only thing is, the body field is used by some other content sources, so
I have to look at the implications setting it to multi-valued will have on
the other data sources. Still, this might do the trick.

Thanks to all that helped on this!

-Jay



On Thu, Jul 2, 2009 at 11:40 AM, Fergus McMenemie fer...@twig.me.uk wrote:

 Shalin Shekhar Mangar wrote:
  On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  It looks like DIH implements its own subset of the Xpath spec.
 
 
 
  Right, DIH has a streaming implementation supporting a subset of XPath
 only.
  The supported things are in the wiki examples.
 
 
 
  I don't see any tests with multiple matching sub nodes, so perhaps DIH
  Xpath does not properly support that and just selects the last matching
  node?
 
 
 
  It selects all matching nodes. But if the field is not multi-valued, it
 will
  store only the last value. I guess this is what is happening here.
 
 
 So do you think it should match them all and add the concatenated text
 as one field?
 
 That would be more Xpath like I think, and less arbitrary than just
 choosing the last one.

 Only when the field in schema.xml in not multiValued. If the field is
 multiValued is should still behave as at present?

 Also... what went wrong with the suggested:-
 field column=body xpath=/book/body/chapter flatten=true/

 Regards Fergus.

Re: Plugin Performance Issues


: I'm not entirely convinced that it's related to our code, but it could be. 
: Just trying to get a sense if other plugins have had similar problems, just
: by the nature of using Solr's resource loading from the /lib directory.

Plugins aren't something that every Solr users -- but enough people use 
them that if there was a fundemental memory leak just from loading plugin 
jars i'm guessing more people would be complaining.

I use plugins in several solr instances, and i've never noticed any 
problems like you describe -- but i don't personally use tomcat.

Otis is right on the money: you need to use profiling tools to really look 
at the heap and see what's taking up all that ram.

Alternately: a quick way to rule out the special plugin class loader would 
be to embed your custom handler directly into the solr.war (The Old Way 
on the SolrPlugins wiki) ... if you still have problems, then the cause 
isn't the plugin classloader.





-Hoss

Retrieve docs with 1 multivalue field hits

2009-07-02 Thread A. Steven Anderson

Greetings!

I thought I remembered seeing a thread related to retrieving only documents
that had more than one hit in a particular multivalue field, but I cannot
find it now.

Regardless, is this possible in Solr 1.3? Solr 1.4?

-- 
A. Steven Anderson
Independent Consultant

Confirming doc change for Wiki for schema / plugins config

2009-07-02 Thread Mark Bennett

There's a particular confusion I've had with the Solr schema and plugins,
Though this stuff is obvious to the gurus, looking around I guess I wasn't
alone in my confusion.

I believe I understand it now and wanted to capture that on the Wiki, but
just double checking and maybe the gurus would have some additional
comments?


Two Syntaxes AND Two Plugin Sets

There is an abbreviated syntax for specifying plugins in the schema, but
there is a more powerful syntax that is preferred.

Also, Solr supports both solr-specific plugins, and is also compatible with
Lucene plugins.  Solr plugins use the more more modern longer syntax, but
Lucene plugins generally must use the abbreviated syntax OR use a custom
adapter class.

These two differences tend to coincide.  Solr plugins use the longer, more
powerful syntax, whereas Lucene plugins tend to use the shorter syntax (or
an adapter, see below).


Two Syntaxes for Defining Field Type Plugins:

Abbreviated Syntax:
fieldType name=... class=...
analyzer class=SomeAnalyer /
!-- Do not put additional plugins here --
/fieldType

Modern Syntax:
fieldType name=... class=...
analyzer
tokenizer class=SomeTokenizer /
filter class=SomeFilter /
!-- other filters ... --
/analyzer
/fieldType

Of course you can have multiple analyzer blocks in the newer syntax, one
for index time and one for search.  And the filters can have options, etc.

This is confusing because the analyzer tag can EITHER have a class=
attribute OR nested subelements, usually of type tokenizer and filter.
You should not do both!  Futher, the main fieldType element also takes a
class attribute, which is required, but this is a separate class (...could
use some narrative as to why)


Two Common Sources of Plugins:

When looking at schema configurations you find online, it's very important
to notice the prefixes in the class name.  Classes starting with
org.apache.solr.analysis. or the shorthand solr. are Solr specific, and
will use the longhand syntax.  Classes starting with
org.apache.lucene.analysis. are NOT native Solr plugins and must EITHER
use the short hand syntax (which limits your functionality), or you need to
add a custom adapter class.

This is generally a good thing.  There are quite a few Lucene plugins out
there, and Solr can use any of them out of the box without the need for
breaking out a Java compiler.  However, when used in this compatibility
mode, you give up some functionality.

And you can't just use the longer syntax with the Lucene plugins. The
advanced syntax isn't directly compatible (at this time).  If you want the
advantages of the long form syntax you need to use a Lucene to Solr adapater
class, often called a factory class.


Examples of Right and Wrong Configurations.

Asian language Solr users will often want to use the CJK processor (CJK =
Chinese, Japanese and Korean).  They will typically use the base Lucene
plugin, but in various configurations.

Examples using CJK Plugins:

!-- Correct Short form using Lucene compatible syntax --
fieldType name=text_cjk class=solr.TextField
  analyzer class=org.apache.lucene. analysis.cjk.CJKAnalyzer/
/fieldType

!-- Incorrect attempt to use long form with Lucene plugins --
fieldType name=text_cjk class=solr.TextField
  analyzer class=org.apache.lucene.analysis.cjk.CJKAnalyzer/
  !-- Wrong: won't be used! --
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
  !-- ... other filters ... --
/fieldType

!-- Correct Long Form syntax for Lucene plugins THAT HAVE AN ADAPTER --
fieldType name=text_cjk class=solr.TextField
  analyzer
!-- This ONLY works if you have an adapter class --
tokenizer class=solr.CJKTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
!-- ... other filters ... --
  /analyzer
/fieldType


There is a nice thread about the adapter class you need.  Later on in the
thread the discussion evolves into whether or not to make an uber Lucene
class loader, and the performance impact that might have here:

http://www.mail-archive.com/solr-user@lucene.apache.org/msg04487.html



--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Confirming doc change for Wiki for schema / plugins config

On Thu, Jul 2, 2009 at 3:53 PM, Mark Bennettmbenn...@ideaeng.com wrote:
 There is an abbreviated syntax for specifying plugins in the schema, but
 there is a more powerful syntax that is preferred.

I think of it as specifying the Analyzer for a field: one can either
specify a Java Analyzer class (opaque, but good for legacy Analyzer
implementations or implementations that don't even use
Tokenizer/TokenFilter chains), or specify an Analyzer as a Tokenizer
followed by a list of Filters.

I'm still planning on cleaning up the schema for 1.4 - I'll see if the
comments can be made a little clearer.

 This is confusing because the analyzer tag can EITHER have a class=
 attribute OR nested subelements, usually of type tokenizer and filter.
 You should not do both!  Futher, the main fieldType element also takes a
 class attribute, which is required, but this is a separate class (...could
 use some narrative as to why)

For polymorphic behavior for everything that falls outside Analyzer.

 Classes starting with
 org.apache.lucene.analysis. are NOT native Solr plugins and must EITHER
 use the short hand syntax (which limits your functionality), or you need to
 add a custom adapter class.

Yeah, for years I've meant to look into getting this to just work
w/o having to create a factory.

FYI - the long-form/short-form is just a classloading thing, and
doesn't relate to factories.  It's only correlated in that something
in the solr namespace should have a factory.

-Yonik
http://www.lucidimagination.com

Re: Preparing the ground for a real multilang index


Michael,

I think you really aught to know the language of the query (from a pulldown, 
from the browser, from user settings, somewhere) and pass that to the 
backend unless your queries are sufficiently long that their language can 
be identified.

Here is a handy tool for playing with language identification:

  http://www.sematext.com/demo/lid/

You'll see how hard it is to guess a language of very short texts. :)
You really want to avoid that huge OR.  Often it makes no sense to OR in 
multilingual context.  Think about the word die (English and German, as you 
know) and what happens when you include that in an OR.  And does it make sense 
to include a very language specific word, say wunderbar, in an OR that goes 
across multiple/all languages?  Funny, they have it listed at 
http://www.merriam-webster.com/dictionary/wunderbar


Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Michael Lackhoff mich...@lackhoff.de
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 2:58:41 PM
 Subject: Preparing the ground for a real multilang index
 
 As pointed out in the recent thread about stemmers and other language
 specifics I should handle them all in their own right. But how?
 
 The first problem is how to know the language. Sometimes I have a
 language identifier within the record, sometimes I have more than one,
 sometimes I have none. How should I handle the non-obvious cases?
 
 Given I somehow know record1 is English and record2 is German. Then I
 need all my (relevant) fields for every language, e.g. I will have
 TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
 what with exotic languages? Use a catch all language without a stemmer?
 
 Now a user searches for TITLE:term and I don't know beforehand the
 language of term. Do I have to expand the query to something like
 TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there
 some sort of copyfield for analyzed fields? Then I could just copy all
 the TITLE_* fields to TITLE and don't bother with the language of the query.
 
 Are there any solutions that prevent an index with thousands of fields
 and dozens of ORed query terms?
 
 I know I will have to implement some better multilanguage support but
 would also like to keep it as simple as possible.
 
 -Michael

Re: Implementing PhraseQuery and MoreLikeThis Query in one app


I could be wrong about MLT - maybe it really does use TF IDF and not raw 
frequency.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Walter Underwood wunderw...@netflix.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 10:26:33 AM
 Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app
 
 I think it works better to use the highest tf.idf terms, not the highest tf.
 That is what I implemented for Ultraseek ten years ago. With tf, you get
 lots of terms with low discrimination power.
 
 wunder
 
 On 7/2/09 4:48 AM, Otis Gospodnetic wrote:
 
  
  Michael - because they are the most frequent, which is how MLT selects terms
  to use for querying, IIRC.
  
  
  Otis --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
  - Original Message 
  From: Michael Ludwig 
  To: solr-user@lucene.apache.org
  Sent: Thursday, July 2, 2009 6:20:05 AM
  Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app
  
  SergeyG schrieb:
  
  Can both queries - PhraseQuery and MoreLikeThis Query - be implemented
  in the same app taking into account the fact that for the former to
  work the stop words list needs to be included and this results in the
  latter putting stop words among the most important words?
  
  Why would the inclusion of a stopword list result in stopwords being of
  top importance in the MoreLikeThis query?
  
  Michael Ludwig

DocSlice andNotSize

2009-07-02 Thread Candide Kemmler


Hi,

I have a simple question rel the DocSlice class. I'm trying to use the  
(very handy) set operations on DocSlices and I'm rather confused by  
the way it behaves.


I have 2 DocSlices, atlDocs which, by looking at the debugger, holds a  
docs array of ints of size 1; the second DocSlice is btlDocs, with a  
docs array of ints of size 67. I know that atlDocs is a subset of  
btlDocs, so the doing btlDocs.andNotSize(atlDocs) should really return  
66.


But it's returning 10.

Any idea what I'm understanding wrong here?

Thanks in advance.

Candide

Re: Preparing the ground for a real multilang index

Not to mention Americans who call themselves wunder. Or brand names, like
LaserJet, which are the same in all languages. Queries are far too short for
effective language id.

You can get language preferences from an HTTP request headers, then allow
people to override them. I think the header is Accept-language, but it has
been a long time since I did that.

I recommend using ISO language codes, en, de, es, fr, and so on, instead of
making up your own, like eng and ger. Don't confuse them with ISO country
codes: uk, us, etc. Korean and Japanese are easy to mix up with the country
codes.

wunder

On 7/2/09 1:15 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

 
 Michael,
 
 I think you really aught to know the language of the query (from a pulldown,
 from the browser, from user settings, somewhere) and pass that to the
 backend unless your queries are sufficiently long that their language can
 be identified.
 
 Here is a handy tool for playing with language identification:
 
   http://www.sematext.com/demo/lid/
 
 You'll see how hard it is to guess a language of very short texts. :)
 You really want to avoid that huge OR.  Often it makes no sense to OR in
 multilingual context.  Think about the word die (English and German, as you
 know) and what happens when you include that in an OR.  And does it make sense
 to include a very language specific word, say wunderbar, in an OR that
 goes across multiple/all languages?  Funny, they have it listed at
 http://www.merriam-webster.com/dictionary/wunderbar
 
 
 Otis--
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: Michael Lackhoff mich...@lackhoff.de
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 2:58:41 PM
 Subject: Preparing the ground for a real multilang index
 
 As pointed out in the recent thread about stemmers and other language
 specifics I should handle them all in their own right. But how?
 
 The first problem is how to know the language. Sometimes I have a
 language identifier within the record, sometimes I have more than one,
 sometimes I have none. How should I handle the non-obvious cases?
 
 Given I somehow know record1 is English and record2 is German. Then I
 need all my (relevant) fields for every language, e.g. I will have
 TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
 what with exotic languages? Use a catch all language without a stemmer?
 
 Now a user searches for TITLE:term and I don't know beforehand the
 language of term. Do I have to expand the query to something like
 TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there
 some sort of copyfield for analyzed fields? Then I could just copy all
 the TITLE_* fields to TITLE and don't bother with the language of the query.
 
 Are there any solutions that prevent an index with thousands of fields
 and dozens of ORed query terms?
 
 I know I will have to implement some better multilanguage support but
 would also like to keep it as simple as possible.
 
 -Michael

Re: Deleting from SolrQueryResponse


: Hi, I was wondering if any has had luck deleting added documents to
: SolrQueryResponse? I am subclassing StandardRequestHandler and after I run
: the handle request body method (super.handleRequestBody(req, rsp);) I won't
: to filter out some of the hits.


DocLists are immutable (if i remember correctly) but your handler can 
always remove the DocList from the SolrQueryResponse and then replce it 
with a new one after you've made your changes.

one thing to keep in mind however is that post-processing a DocList to 
filter stuff out is almost never a good idea -- things get really 
convoluted when you think about dealing with pagination and except for 
some really trivial use cases you can never know what your upper bound 
should be when deciding how many hits to request from underlying 
IndexSearcher.  You're usually better off restructuring your problem so 
that you can construct a Query/Filter/DocSet that you want to filter by 
first and *then* executing the search to generate the DocList in a single 
pass.


PS: replying to your own message (or reposting it) to bump it up 
generally doesn't encourage replies any faster -- it just increases the 
volume of traffic on the list, and if anything antagonizes people and 
makes them less interested in responding.





-Hoss

Re: complex OR query not working


: I want to execute the following query:
: (spacegroupID:g*) OR (!userID:g*).

First: ! is not a negation operator in the lucene/solr query parser

: In above syntax  (!userID:g*) gives results correctly.

...i don't think it's doing what you think it's doing.

second: boolean queries can't be purely negative.  they need to select 
something.  the second clause of your main query is a boolean query with a 
single negative clause.

try this instead...

 spacegroupID:g* (*:* -userID:g*)

...that will match 
any doc with a spacegroupId starting with the g character 
OR: any doc, except those with userID starting with the g character



-Hoss

Re: Building Solr index with Lucene

On Wed, Jul 1, 2009 at 6:49 PM, Ben Bangertb...@groovie.org wrote:
 For performance reasons, we're attempting to build the index used with Solr

Solr 1.4 has a binary communications format, and a
StreamingUpdateSolrServer that massively improves indexing
performance.
You way want to revisit the decision to bypass Solr, esp as more
indexing functionality emerges (update processors, etc).
There is also EmbeddedSolrServer if you want something in-process.

-Yonik
http://www.lucidimagination.com

Re: Excluding characters from a wildcard query


: I'm not sure if you can do prefix queries with the fq parameter. You will
: need to use the 'q' parameter for that.

fq supports anything q supports ... with the QParser and local params 
options it can be any syntax you want (as long as there is a QParser for 
it)


-Hoss

Re: Solr spring application context error


: I did try that. The problem is that you can't tell
: FileSystemXmlApplicationContext to load with a different ClassLoader.

why not?  

it subclasses DefaultResourceLoader which has the  setClassLoader method 
Mark pointed out.



-Hoss

Re: DocSlice andNotSize

On Thu, Jul 2, 2009 at 4:24 PM, Candide Kemmlercand...@palacehotel.org wrote:
 I have a simple question rel the DocSlice class. I'm trying to use the (very
 handy) set operations on DocSlices and I'm rather confused by the way it
 behaves.

 I have 2 DocSlices, atlDocs which, by looking at the debugger, holds a
 docs array of ints of size 1; the second DocSlice is btlDocs, with a
 docs array of ints of size 67. I know that atlDocs is a subset of btlDocs,
 so the doing btlDocs.andNotSize(atlDocs) should really return 66.

 But it's returning 10.

The short answer is that all of the set operations were only designed
for DocSets (as opposed to DocLists).
Yes, perhaps DocList should not have extended DocSet...

-Yonik
http://www.lucidimagination.com

Re: Implementing PhraseQuery and MoreLikeThis Query in one app


Otis,

Your recipe does work: after copying an indexing field and excluding stop
words the MoreLikeThis query started fetching meaningful results. :)

Just one issue remained. 

When I execute query in this way:

String query = q=id:1mlt.fl=content...fl=title+author+score;
HttpClient client = new HttpClient();
GetMethod get = new GetMethod(http://localhost:8080/solr/mlt;);
get.setQueryString(query);
client.executeMethod(get);
...

it works fine bringing results as an XML string. 

But when I use Solr-like approach:

String query = id:1;
solrQuery.setQuery(query);
solrQuery.setParam(mlt, true);
solrQuery.setParam(mlt.fl, content);
solrQuery.setParam(fl, title author score);
QueryResponse queryResponse = server.query( solrQuery );

the result contains only one doc with id=1 and no other more like docs. 

In my solrconfig.xml, I have these settings: 
requestHandler name=/mlt class=solr.MoreLikeThisHandler ...
requestHandler name=standard class=solr.SearchHandler default=true
...

I guess it all is a matter of syntax but I can't figure out what's wrong.

Thank you very much (and again, thanks to Michael and Walter).

Cheers,
Sergey



Michael Ludwig-4 wrote:
 
 SergeyG schrieb:
 
 Can both queries - PhraseQuery and MoreLikeThis Query - be implemented
 in the same app taking into account the fact that for the former to
 work the stop words list needs to be included and this results in the
 latter putting stop words among the most important words?
 
 Why would the inclusion of a stopword list result in stopwords being of
 top importance in the MoreLikeThis query?
 
 Michael Ludwig
 
 

-- 
View this message in context: 
http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24314840.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Preparing the ground for a real multilang index

2009-07-02 Thread Paul Libbrecht

I believe the proper way is for the server to compute a list of  
accepted languages in order of preferences.
The web-platform language (e.g. the user-setting), and the values in  
the Accept-Language http header (which are from the browser or  
platform).


Then you expand your query for surfing waves (say) to:
- phrase query: surfing waves exactly (^2.0)
- two terms, no stemming: surfing waves (^1.5)
- iterate through the languages and query for stemmed variants:
  - english: surf wav ^1.0
  - german surfing wave ^0.9
  - 
- then maybe even try the phonetic analyzer (matched in a separate  
field probably)


I think this is a common pattern on the web where the users, browsers,  
and servers are all somewhat multilingual.


paul

Le 02-juil.-09 à 22:15, Otis Gospodnetic a écrit :



Michael,

I think you really aught to know the language of the query (from a  
pulldown, from the browser, from user settings, somewhere) and pass  
that to the backend unless your queries are sufficiently long  
that their language can be identified.


Here is a handy tool for playing with language identification:

 http://www.sematext.com/demo/lid/

You'll see how hard it is to guess a language of very short texts. :)
You really want to avoid that huge OR.  Often it makes no sense to  
OR in multilingual context.  Think about the word die (English and  
German, as you know) and what happens when you include that in an  
OR.  And does it make sense to include a very language specific  
word, say wunderbar, in an OR that goes across multiple/all  
languages?  Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar



Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Michael Lackhoff mich...@lackhoff.de
To: solr-user@lucene.apache.org
Sent: Thursday, July 2, 2009 2:58:41 PM
Subject: Preparing the ground for a real multilang index

As pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how?

The first problem is how to know the language. Sometimes I have a
language identifier within the record, sometimes I have more than  
one,

sometimes I have none. How should I handle the non-obvious cases?

Given I somehow know record1 is English and record2 is German. Then I
need all my (relevant) fields for every language, e.g. I will have
TITLE_ENG and TITLE_GER and both will have their respective  
stemmer. But
what with exotic languages? Use a catch all language without a  
stemmer?


Now a user searches for TITLE:term and I don't know beforehand the
language of term. Do I have to expand the query to something like
TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is  
there
some sort of copyfield for analyzed fields? Then I could just copy  
all
the TITLE_* fields to TITLE and don't bother with the language of  
the query.


Are there any solutions that prevent an index with thousands of  
fields

and dozens of ORed query terms?

I know I will have to implement some better multilanguage support but
would also like to keep it as simple as possible.

-Michael






smime.p7s
Description: S/MIME cryptographic signature

Re: Deleting from SolrQueryResponse

2009-07-02 Thread pof




hossman wrote:
 
 one thing to keep in mind however is that post-processing a DocList to 
 filter stuff out is almost never a good idea -- things get really 
 convoluted when you think about dealing with pagination and except for 
 some really trivial use cases you can never know what your upper bound 
 should be when deciding how many hits to request from underlying 
 IndexSearcher.  You're usually better off restructuring your problem so 
 that you can construct a Query/Filter/DocSet that you want to filter by 
 first and *then* executing the search to generate the DocList in a single 
 pass.
 
I was wanting to edit the DocList in a custom SearchComponent to be executed
after the QueryComponent. I do not require facetting etc. If do not want
facetted results, will I still need to take any special steps not to break
the doclist?

hossman wrote:
 
 PS: replying to your own message (or reposting it) to bump it up 
 generally doesn't encourage replies any faster -- it just increases the 
 volume of traffic on the list, and if anything antagonizes people and 
 makes them less interested in responding.
 
Okay, sorry I wasn't certain on the protocol on that.


Thanks, Brett.
-- 
View this message in context: 
http://www.nabble.com/Deleting-from-SolrQueryResponse-tp24266686p24315607.html
Sent from the Solr - User mailing list archive at Nabble.com.

reindexed data on master not replicated to slave

2009-07-02 Thread solr jay

Hi,

When index data were corrupted on master instance, I wanted to wipe out all
the index data and re-index everything. I was hoping the newly created index
data would be replicated to slaves, but it wasn't.

Here are the steps I performed:

1. stop master
2. delete the directory 'index'
3. start master
4. disable replication on master
5. index all data from scratch
6. enable replication on master

It seemed from log file that the slave instances discovered that new index
are available and claimed that new index installed, and then trying to
update index properties, but looking into the index directory on slaves, you
will find that no index data files were updated or added, plus slaves keep
trying to get new index. Here are some from slave's log file:

Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Starting replication process
Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Number of files in latest snapshot in master: 69
Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Total time taken for download : 0 secs
Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Conf files are not downloaded or are in sync
Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps
INFO: New index installed. Updating index properties...
Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Master's version: 1246488421310, generation: 9
Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave's version: 1246385166228, generation: 56
Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Starting replication process
Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Number of files in latest snapshot in master: 69
Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Total time taken for download : 0 secs
Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Conf files are not downloaded or are in sync
Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps
INFO: New index installed. Updating index properties...


Is this process incorrect, or it is a bug? If the process is incorrect, what
is the right process?

Thanks,

J

Re: reindexed data on master not replicated to slave


Jay,

You didn't mention which version of Solr you are using.  It looks like some 
trunk or nightly version.  Maybe you can try the latest nightly?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: solr jay solr...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 9:14:48 PM
 Subject: reindexed data on master not replicated to slave
 
 Hi,
 
 When index data were corrupted on master instance, I wanted to wipe out all
 the index data and re-index everything. I was hoping the newly created index
 data would be replicated to slaves, but it wasn't.
 
 Here are the steps I performed:
 
 1. stop master
 2. delete the directory 'index'
 3. start master
 4. disable replication on master
 5. index all data from scratch
 6. enable replication on master
 
 It seemed from log file that the slave instances discovered that new index
 are available and claimed that new index installed, and then trying to
 update index properties, but looking into the index directory on slaves, you
 will find that no index data files were updated or added, plus slaves keep
 trying to get new index. Here are some from slave's log file:
 
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Starting replication process
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Number of files in latest snapshot in master: 69
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Total time taken for download : 0 secs
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Conf files are not downloaded or are in sync
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps
 INFO: New index installed. Updating index properties...
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Master's version: 1246488421310, generation: 9
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Slave's version: 1246385166228, generation: 56
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Starting replication process
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Number of files in latest snapshot in master: 69
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Total time taken for download : 0 secs
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Conf files are not downloaded or are in sync
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps
 INFO: New index installed. Updating index properties...
 
 
 Is this process incorrect, or it is a bug? If the process is incorrect, what
 is the right process?
 
 Thanks,
 
 J

Re: Implementing PhraseQuery and MoreLikeThis Query in one app


Sergey,

Glad to hear the suggestion worked!

I can't spot the problem (though I think you want to use a comma to separate 
the list of fields in the fl parameter value).
I suggest you look at the servlet container logs and Solr logs and compare 
requests that these two calls make.  Once you see what how the second one is 
different from the first one, you will probably be able to figure out how to 
adjust the second one to produce the same results as the first one.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: SergeyG sgoldb...@mail.ru
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 6:17:59 PM
 Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app
 
 
 Otis,
 
 Your recipe does work: after copying an indexing field and excluding stop
 words the MoreLikeThis query started fetching meaningful results. :)
 
 Just one issue remained. 
 
 When I execute query in this way:
 
 String query = q=id:1mlt.fl=content...fl=title+author+score;
 HttpClient client = new HttpClient();
 GetMethod get = new GetMethod(http://localhost:8080/solr/mlt;);
 get.setQueryString(query);
 client.executeMethod(get);
 ...
 
 it works fine bringing results as an XML string. 
 
 But when I use Solr-like approach:
 
 String query = id:1;
 solrQuery.setQuery(query);
 solrQuery.setParam(mlt, true);
 solrQuery.setParam(mlt.fl, content);
 solrQuery.setParam(fl, title author score);
 QueryResponse queryResponse = server.query( solrQuery );
 
 the result contains only one doc with id=1 and no other more like docs. 
 
 In my solrconfig.xml, I have these settings: 
 ...
 
 ...
 
 I guess it all is a matter of syntax but I can't figure out what's wrong.
 
 Thank you very much (and again, thanks to Michael and Walter).
 
 Cheers,
 Sergey
 
 
 
 Michael Ludwig-4 wrote:
  
  SergeyG schrieb:
  
  Can both queries - PhraseQuery and MoreLikeThis Query - be implemented
  in the same app taking into account the fact that for the former to
  work the stop words list needs to be included and this results in the
  latter putting stop words among the most important words?
  
  Why would the inclusion of a stopword list result in stopwords being of
  top importance in the MoreLikeThis query?
  
  Michael Ludwig
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24314840.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr slave Heap space error and index size issue

2009-07-02 Thread Noble Paul നോബിള്‍ नोब्ळ्

I don't think the index should *suddenly* increase in size if you are just
adding/updating/deleting documents.
It is normal that it temporarily increases during optimization.
35GB for 1.5M docs sounds a lot. You either have large fields or you store
them or both?

Maybe share your schema, show relevant solrconfig settings, list your index
directory, share some of the stats from the Solr admin stats page, tell us
about your JVM parameters, your RAM, etc.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Vikash Kontia vikash.kon...@gmail.com
To: solr-user@lucene.apache.org
Sent: Sunday, June 28, 2009 10:46:57 PM
Subject: Solr slave Heap space error and index size issue

Hi All,

I have 1 master and 1 slave machine at deployment. I am using Solr1.4
nightly Build. My fresh index size is 35GB for 1.5 million documents with
approx 50 fields each document I taken care of omitNorm and Stored in
schema. I have approx 1 update daily and I run commit in every hr. in
5-6 days after fresh index index size suddenly increased (no optimization in
between) by 150GB and then query takes long time and java heap error comes.
I run optimize in this index Its takes long time and result it increase
index size more more then 200GB and it didn't show about optimize completed.
merge factor is default as given in solr build.

For fixing this issue I have to use re-index in every week almost. I think
issue with frequent update on index.

Please help me for debug the issue. Or Is I am missing something in
confuguration.

Thanks
Vikash Kontia
--
View this message in context:
http://www.nabble.com/Solr-slave-Heap-space-error-and-index-size-issue-tp24247690p24247690.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: reindexed data on master not replicated to slave

2009-07-02 Thread solr jay

it's nightly build of May 10. I'll try the latest.

Thanks,

J


On Thu, Jul 2, 2009 at 8:09 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:


 Jay,

 You didn't mention which version of Solr you are using.  It looks like some
 trunk or nightly version.  Maybe you can try the latest nightly?

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: solr jay solr...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Thursday, July 2, 2009 9:14:48 PM
  Subject: reindexed data on master not replicated to slave
 
  Hi,
 
  When index data were corrupted on master instance, I wanted to wipe out
 all
  the index data and re-index everything. I was hoping the newly created
 index
  data would be replicated to slaves, but it wasn't.
 
  Here are the steps I performed:
 
  1. stop master
  2. delete the directory 'index'
  3. start master
  4. disable replication on master
  5. index all data from scratch
  6. enable replication on master
 
  It seemed from log file that the slave instances discovered that new
 index
  are available and claimed that new index installed, and then trying to
  update index properties, but looking into the index directory on slaves,
 you
  will find that no index data files were updated or added, plus slaves
 keep
  trying to get new index. Here are some from slave's log file:
 
  Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Starting replication process
  Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Number of files in latest snapshot in master: 69
  Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Total time taken for download : 0 secs
  Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Conf files are not downloaded or are in sync
  Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller
 modifyIndexProps
  INFO: New index installed. Updating index properties...
  Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Master's version: 1246488421310, generation: 9
  Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Slave's version: 1246385166228, generation: 56
  Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Starting replication process
  Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Number of files in latest snapshot in master: 69
  Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Total time taken for download : 0 secs
  Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller
 fetchLatestIndex
  INFO: Conf files are not downloaded or are in sync
  Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller
 modifyIndexProps
  INFO: New index installed. Updating index properties...
 
 
  Is this process incorrect, or it is a bug? If the process is incorrect,
 what
  is the right process?
 
  Thanks,
 
  J

Re: reindexed data on master not replicated to slave

jay , I see updating index properties... twice



this should happen rarely. in your case it should have happened only
once. because you cleaned up the master only once


On Fri, Jul 3, 2009 at 6:09 AM, Otis
Gospodneticotis_gospodne...@yahoo.com wrote:

 Jay,

 You didn't mention which version of Solr you are using.  It looks like some 
 trunk or nightly version.  Maybe you can try the latest nightly?

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: solr jay solr...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 2, 2009 9:14:48 PM
 Subject: reindexed data on master not replicated to slave

 Hi,

 When index data were corrupted on master instance, I wanted to wipe out all
 the index data and re-index everything. I was hoping the newly created index
 data would be replicated to slaves, but it wasn't.

 Here are the steps I performed:

 1. stop master
 2. delete the directory 'index'
 3. start master
 4. disable replication on master
 5. index all data from scratch
 6. enable replication on master

 It seemed from log file that the slave instances discovered that new index
 are available and claimed that new index installed, and then trying to
 update index properties, but looking into the index directory on slaves, you
 will find that no index data files were updated or added, plus slaves keep
 trying to get new index. Here are some from slave's log file:

 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Starting replication process
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Number of files in latest snapshot in master: 69
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Total time taken for download : 0 secs
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Conf files are not downloaded or are in sync
 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps
 INFO: New index installed. Updating index properties...
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Master's version: 1246488421310, generation: 9
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Slave's version: 1246385166228, generation: 56
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Starting replication process
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Number of files in latest snapshot in master: 69
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Total time taken for download : 0 secs
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Conf files are not downloaded or are in sync
 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps
 INFO: New index installed. Updating index properties...


 Is this process incorrect, or it is a bug? If the process is incorrect, what
 is the right process?

 Thanks,

 J





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Preparing the ground for a real multilang index