Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-21 Thread Robert Gründler

On 20.04.11 18:51, Robert Muir wrote:

Hi, there is a proposed patch uploaded to the issue. Maybe you can
help by reviewing/testing it?


if i succeed in compiling solr, i can test the patch. Is this the right 
starting point

for such an endeavour ? http://wiki.apache.org/solr/HackingSolr



-robert


2011/4/20 Robert Gründlerrob...@dubture.com:

Hi all,

i'm getting the following exception when using highlighting for a field
containing HTMLStripCharFilterFactory:

org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ...
exceeds length of provided text sized 21

It seems this is a know issue:

https://issues.apache.org/jira/browse/LUCENE-2208

Does anyone know if there's a fix implemented yet in solr?


thanks!


-robert








Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Robert Gründler

we're indexing around 10M records from a mysql database into
a single solr core.

The DataImportHandler needs to join 3 sub-entities to denormalize
the data.

We've run into some troubles for the first 2 attempts, but setting
batchSize=-1 for the dataSource resolved the issues.

Do you need a lot of complex joins to import the data from mysql?



-robert




On 4/21/11 8:08 PM, Scott Bigelow wrote:

I've been using Solr for a while now, indexing 2-4 million records
using the DIH to pull data from MySQL, which has been working great.
For a new project, I need to index about 20M records (30 fields) and I
have been running into issues with MySQL disconnects, right around
15M. I've tried several remedies I've found on blogs, changing
autoCommit, batchSize etc., and none of them have seem to majorly
resolved the issue. It got me wondering: Is this the way everyone does
it? What about 100M records up to 1B; are those all pulled using DIH
and a single query?

I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.

Thanks for your help, I really enjoy using Solr and I look forward to
indexing even more data!




HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-20 Thread Robert Gründler

Hi all,

i'm getting the following exception when using highlighting for a field 
containing HTMLStripCharFilterFactory:


org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 
... exceeds length of provided text sized 21


It seems this is a know issue:

https://issues.apache.org/jira/browse/LUCENE-2208

Does anyone know if there's a fix implemented yet in solr?


thanks!


-robert





DataImportHandlerDeltaQueryViaFullImport and delete query

2011-04-18 Thread Robert Gründler

Hi,

when using 
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport to 
periodically
run a delta-import, is it necessary to run a separate normal 
delta-import after it to delete entries

from the index (using deletedPkQuery)?

If so, what's the point of using this method for running delta-imports? 
If not, how can i delete specific

entries with this delta-import method?

regards


-robert





Re: DataImportHandlerDeltaQueryViaFullImport and delete query

2011-04-18 Thread Robert Gründler

On 18.04.11 09:23, Bill Bell wrote:

It runs delta imports faster. Normally you need to get the Pks that
changed, and then run it through query= which is slow when you have a
lot of Ids
but the query= only adds/updates entries. I'm not sure how to delete 
entries

by running a query like select ... from ... where deleted = 1 .

as far as i understand there's *postImportDeleteQuery and 
**deletedPkQuery *to achieve this.*


*Where according to the wiki *deletedPkQuery*  is only used by 
delta-imports, and

*postImportDeleteQuery* is used after a full import.

And from my understanding using 
dataimport?command=full-importclean=false matches

neither of the two, or am i wrong with that?*

*
thanks,

-robert
*

*


DisMaxQueryParser: Unknown function min in FunctionQuery

2011-03-29 Thread Robert Gründler

Hi all,

i'm trying to implement a FunctionQuery using the bf parameter of the 
DisMaxQueryParser, however, i'm getting an exception:


Unknown function min in FunctionQuery('min(1,2)', pos=4)

The request that causes the error looks like this:

http://localhost:2345/solr/main/select?qt=dismaxqf=name^0.1qf=name_exact^10.0debugQuery=truebf=min(1,2)version=1.2wt=jsonjson.nl=mapq=+foostart=0rows=3


I'm not sure where the pos=4 part of the FunctionQuery is coming from.

My Solr version is 1.4.1.

Has anyone a hint why i'm getting this error?


thanks!

-robert




Conditional Scoring (was: Re: DisMaxQueryParser: Unknown function min in FunctionQuery)

2011-03-29 Thread Robert Gründler

sorry, didn't see that.


So, as also the relevance functions are only available in solr  4.0 
(http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions), i'm not
sure if i can solve our requirement in one query ( i thought i could use 
a function query for this).


Here's our Problem:

We have 3 Fields:

1. exact_match ( text )
2. fuzzy_match ( text )
3. popularity ( integer )

Our requirement looks as follows:

All results which have a match in exact_match MUST score higher than 
results without a match in exact_match, regardless of the value in the 
popularity field. All results which have no match in exact_match 
should use the popularity field for scoring.


Is this possible without using a function query ?


thanks.


-robert





On 29.03.11 16:34, Erik Hatcher wrote:

On Mar 29, 2011, at 10:01 , Robert Gründler wrote:


Hi all,

i'm trying to implement a FunctionQuery using the bf parameter of the 
DisMaxQueryParser, however, i'm getting an exception:

Unknown function min in FunctionQuery('min(1,2)', pos=4)

The request that causes the error looks like this:

http://localhost:2345/solr/main/select?qt=dismaxqf=name^0.1qf=name_exact^10.0debugQuery=truebf=min(1,2)version=1.2wt=jsonjson.nl=mapq=+foostart=0rows=3


I'm not sure where the pos=4 part of the FunctionQuery is coming from.

My Solr version is 1.4.1.

Has anyone a hint why i'm getting this error?

 From http://wiki.apache.org/solr/FunctionQuery#min - min() is 3.2 (though I 
think that really means 3.1 now, right??).  Definitely not in 1.4.1.

Erik





MySQL queries high when using delta-import

2011-03-14 Thread Robert Gründler

Hi,

we have 3 solr cores, each of them is running a delta-import every 2 
minutes on

a MySQL database.

We've noticed a significant increase of MySQL queries per second, since 
we've

started the delta updates.

Before that, the database server received between 50 and 100 queries per 
second,
since the Delta-Imports the query count has rised up to 100 to 200 
queries per second.


I've temporarily disabled the delta imports for 2 hours, and the queries 
per second immediately

decreased again to 50-100 per second.

I followed the Wiki entry which only uses one Query for the delta import:

http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

I did not expect the queries per seconds to the database increase that 
high, so i'm wondering

if others experienced similar issues.


cheers

-robert




Dataimport performance

2010-12-15 Thread Robert Gründler
Hi,

we're looking for some comparison-benchmarks for importing large tables from a 
mysql database (full import).

Currently, a full-import of ~ 8 Million rows from a MySQL database takes around 
3 hours, on a QuadCore Machine with 16 GB of
ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, 
where it is the only app. The tomcat instance
has the following memory-related java_opts:

-Xms4096M -Xmx5120M


The data-config.xml looks like this (only 1 entity):

  entity name=track query=select t.id as id, t.title as title, l.title 
as label from track t left join label l on (l.id = t.label_id) where t.deleted 
= 0 transformer=TemplateTransformer
field column=title name=title_t /
field column=label name=label_t /
field column=id name=sf_meta_id /
field column=metaclass template=Track name=sf_meta_class/
field column=metaid template=${track.id} name=sf_meta_id/
field column=uniqueid template=Track_${track.id} 
name=sf_unique_id/

entity name=artists query=select a.name as artist from artist a 
left join track_artist ta on (ta.artist_id = a.id) where 
ta.track_id=${track.id}
  field column=artist name=artists_t /
/entity

  /entity


We have the feeling that 3 hours for this import is quite long - regarding the 
performance of the server running solr/mysql. 

Are we wrong with that assumption, or do people experience similar import times 
with this amount of data to be imported?


thanks!


-robert





Re: Dataimport performance

2010-12-15 Thread Robert Gründler
 What version of Solr are you using?


Solr Specification Version: 1.4.1
Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
Lucene Specification Version: 2.9.3
Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55


-robert



 
 Adam
 
 2010/12/15 Robert Gründler rob...@dubture.com
 
 Hi,
 
 we're looking for some comparison-benchmarks for importing large tables
 from a mysql database (full import).
 
 Currently, a full-import of ~ 8 Million rows from a MySQL database takes
 around 3 hours, on a QuadCore Machine with 16 GB of
 ram and a Raid 10 storage setup. Solr is running on a apache tomcat
 instance, where it is the only app. The tomcat instance
 has the following memory-related java_opts:
 
 -Xms4096M -Xmx5120M
 
 
 The data-config.xml looks like this (only 1 entity):
 
 entity name=track query=select t.id as id, t.title as title,
 l.title as label from track t left join label l on (l.id = t.label_id)
 where t.deleted = 0 transformer=TemplateTransformer
   field column=title name=title_t /
   field column=label name=label_t /
   field column=id name=sf_meta_id /
   field column=metaclass template=Track name=sf_meta_class/
   field column=metaid template=${track.id} name=sf_meta_id/
   field column=uniqueid template=Track_${track.id}
 name=sf_unique_id/
 
   entity name=artists query=select a.name as artist from artist a
 left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${
 track.id}
 field column=artist name=artists_t /
   /entity
 
 /entity
 
 
 We have the feeling that 3 hours for this import is quite long - regarding
 the performance of the server running solr/mysql.
 
 Are we wrong with that assumption, or do people experience similar import
 times with this amount of data to be imported?
 
 
 thanks!
 
 
 -robert
 
 
 
 



Re: Dataimport performance

2010-12-15 Thread Robert Gründler
i've benchmarked the import already with 500k records, one time without the 
artists subquery, and one time without the join in the main query:


Without subquery: 500k in 3 min 30 sec

Without join and without subquery: 500k in 2 min 30.

With subquery and with left join:   320k in 6 Min 30


so the joins / subqueries are definitely a bottleneck. 

How exactly did you implement the custom data import? 

In our case, we need to de-normalize the relations of the sql data for the 
index, 
so i fear i can't really get rid of the join / subquery.


-robert





On Dec 15, 2010, at 15:43 , Tim Heckman wrote:

 2010/12/15 Robert Gründler rob...@dubture.com:
 The data-config.xml looks like this (only 1 entity):
 
  entity name=track query=select t.id as id, t.title as title, 
 l.title as label from track t left join label l on (l.id = t.label_id) where 
 t.deleted = 0 transformer=TemplateTransformer
field column=title name=title_t /
field column=label name=label_t /
field column=id name=sf_meta_id /
field column=metaclass template=Track name=sf_meta_class/
field column=metaid template=${track.id} name=sf_meta_id/
field column=uniqueid template=Track_${track.id} 
 name=sf_unique_id/
 
entity name=artists query=select a.name as artist from artist a 
 left join track_artist ta on (ta.artist_id = a.id) where 
 ta.track_id=${track.id}
  field column=artist name=artists_t /
/entity
 
  /entity
 
 So there's one track entity with an artist sub-entity. My (admittedly
 rather limited) experience has been that sub-entities, where you have
 to run a separate query for every row in the parent entity, really
 slow down data import. For my own purposes, I wrote a custom data
 import using SolrJ to improve the performance (from 3 hours to 10
 minutes).
 
 Just as a test, how long does it take if you comment out the artists entity?



Copying the index from one solr instance to another

2010-12-15 Thread Robert Gründler
Hi again,

let's say you have 2 solr Instances, which have both exactly the same 
configuration (schema, solrconfig, etc).

Could it cause any troubles if we import an index from a SQL database on solr 
instance A, and copy the whole
index to the datadir of solr instance B (both solr instances run on different 
servers) ?.

As far as i can tell, this should work and solr instance B should have the 
exact same index as solr instance A after the copy-process.

Do we miss something, or is this workflow safe to go with?

-robert

Re: Copying the index from one solr instance to another

2010-12-15 Thread Robert Gründler
thanks for your feedback. we can shutdown both solr servers for the time of the 
copy-process, and both 
solr instances run the same version, so we should be ok.

i'll let you know if we encounter any troubles.


-robert



On Dec 15, 2010, at 18:11 , Shawn Heisey wrote:

 On 12/15/2010 10:05 AM, Robert Gründler wrote:
 Hi again,
 
 let's say you have 2 solr Instances, which have both exactly the same 
 configuration (schema, solrconfig, etc).
 
 Could it cause any troubles if we import an index from a SQL database on 
 solr instance A, and copy the whole
 index to the datadir of solr instance B (both solr instances run on 
 different servers) ?.
 
 As far as i can tell, this should work and solr instance B should have the 
 exact same index as solr instance A after the copy-process.
 
 I believe this should work, but I would take a couple of precautions.  I'd 
 stop Solr before putting the new index into place.  If you can't have it down 
 for the entirety of the copy process, then copy it into an adjacent 
 directory, shut down solr, rename the directories, and restart Solr.
 
 If the Solr that built the index (specifically, the Lucene that comes with 
 it) is newer than the one that you are copying to, it won't work.
 
 If you've checked all that and if you're still having trouble, let us know.
 
 Shawn
 



Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
Hi,

we have a serious harddisk problem, and it's definitely related to a 
full-import from a relational
database into a solr index.

The first time it happened on our development server, where the raidcontroller 
crashed during a full-import
of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2 of 
the harddisks where the solr
index files are located stopped working (we needed to replace them).

After the crash of the raid controller, we decided to move the development of 
solr/index related stuff to our
local development machines. 

Yesterday i was running another full-import of ~10 Million documents on my 
local development machine, 
and during the import, a harddisk failure occurred. Since this failure, my 
harddisk activity seems to 
be around 100% all the time, even if no solr server is running at all. 

I've been googling the last 2 days to find some info about solr related 
harddisk problems, but i didn't find anything
useful.

Are there any steps we need to take care of in respect to harddisk failures 
when doing a full-import? Right now,
our steps look like this:

1. Delete the current index
2. Restart solr, to load the updated schemas
3. Start the full import

Initially, the solr index and the relational database were located on the same 
harddisk. After the crash, we moved
the index to a separate harddisk, but nevertheless this harddisk crashed too.

I'd really appreciate any hints on what we might do wrong when importing data, 
as we can't release this
on our production servers when there's the risk of harddisk failures.


thanks.


-robert







Re: Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
 The very first thing I'd ask is how much free space is on your disk
 when this occurs? Is it possible that you're simply filling up your
 disk?

no, i've checked that already. all disks have plenty of space (they have
a capacity of 2TB, and are currently filled up to 20%.

 
 do note that an optimize may require up to 2X the size of your index
 if/when it occurs. Are you sure you aren't optimizing as you add
 items to your index?
 

index size is not a problem in our case. Our index currently has about 3GB.

What do you mean with optimizing as you add items to your index? 

 But I've never heard of Solr causing hard disk crashes,

neither did we, and google is the same opinion. 

One thing that i've found is the mergeFactor value:

http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor

Our sysadmin speculates that maybe the chunk size of our raid/harddisks
and the segment size of the lucene index does not play well together.

Does the lucene segment size affect how the data is written to the disk?


thanks for your help.


-robert







 
 Best
 Erick
 
 2010/12/2 Robert Gründler rob...@dubture.com
 
 Hi,
 
 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.
 
 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).
 
 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.
 
 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.
 
 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.
 
 Are there any steps we need to take care of in respect to harddisk failures
 when doing a full-import? Right now,
 our steps look like this:
 
 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import
 
 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.
 
 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.
 
 
 thanks.
 
 
 -robert
 
 
 
 
 
 



Re: Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
On Dec 2, 2010, at 15:43 , Sven Almgren wrote:

 What Raid controller do you use, and what kernel version? (Assuming
 Linux). We hade problems during high load with a 3Ware raid controller
 and the current kernel for Ubuntu 10.04, we hade to downgrade the
 kernel...
 
 The problem was a bug in the driver that only showed up with very high
 disk load (as is the case when doing imports)
 

We're running freebsd:

RaidController  3ware 9500S-8
Corrupt unit: Raid-10 3725.27GB 256K Stripe Size without BBU
Freebsd 7.2, UFS Filesystem.



 /Sven
 
 2010/12/2 Robert Gründler rob...@dubture.com:
 The very first thing I'd ask is how much free space is on your disk
 when this occurs? Is it possible that you're simply filling up your
 disk?
 
 no, i've checked that already. all disks have plenty of space (they have
 a capacity of 2TB, and are currently filled up to 20%.
 
 
 do note that an optimize may require up to 2X the size of your index
 if/when it occurs. Are you sure you aren't optimizing as you add
 items to your index?
 
 
 index size is not a problem in our case. Our index currently has about 3GB.
 
 What do you mean with optimizing as you add items to your index?
 
 But I've never heard of Solr causing hard disk crashes,
 
 neither did we, and google is the same opinion.
 
 One thing that i've found is the mergeFactor value:
 
 http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor
 
 Our sysadmin speculates that maybe the chunk size of our raid/harddisks
 and the segment size of the lucene index does not play well together.
 
 Does the lucene segment size affect how the data is written to the disk?
 
 
 thanks for your help.
 
 
 -robert
 
 
 
 
 
 
 
 
 Best
 Erick
 
 2010/12/2 Robert Gründler rob...@dubture.com
 
 Hi,
 
 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.
 
 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).
 
 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.
 
 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.
 
 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.
 
 Are there any steps we need to take care of in respect to harddisk failures
 when doing a full-import? Right now,
 our steps look like this:
 
 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import
 
 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.
 
 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.
 
 
 thanks.
 
 
 -robert
 
 
 
 
 
 
 
 



Is this sort order possible in a single query?

2010-11-24 Thread Robert Gründler
Hi,

we have a requirement for one of our search results which has a quite complex 
sorting strategy. Let me explain the document first, using an example:

The document is a book. It has several indexed text fields: Title, Author, 
Distributor. It has two integer columns, where one reflects the number of sold 
copies (num_copies), and the other reflects
the number of comments on the website (num_comments).

The Requirement for the relevancy looks like this:

* Documents which have exact matches in the Author field, should be ranked 
highest, disregarding their values in num_copies and num_comments fields  
* After the exact matches, the sorting should be based on the value in the 
field num_copies, but only for documents, where this field is set
* After the num_copies matches, the sorting should be based on num_comments

I'm wondering is this kind of sort order can be implemented in a single query, 
or if i need to break it down into several queries and merge the results on 
application level.

-robert




Re: Is this sort order possible in a single query?

2010-11-24 Thread Robert Gründler
thanks a lot for the explanation. i'm a little confused about solr 1.5, 
especially
after finding this wiki page:

http://wiki.apache.org/solr/Solr1.5

Is there a stable build available for version 1.5, so i can test your suggestion
using functionquery?


-robert



On Nov 24, 2010, at 1:53 PM, Geert-Jan Brits wrote:

 You could do it with sorting on a functionquery (which is supported from
 solr 1.5)
 http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
 http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
 Consider the search:
 http://localhost:8093/solr/select?author:'j.k.rowling'
 
 sorting like you specified would involve:
 
 1. introducing an extra field: 'author_exact' of type 'string' which takes
 care of the exact matching. (You can populate it by defining it as a
 copyfield of Author so your indexing-code doesn't change)
 2. set sortMissingLast=true for 'num_copies' and 'num_comments'
 like:  fieldType
 name=num_copies sorMissingLast=true 
 
 this makes sure that documents which don't have the value set end up at the
 end of the sort when sorted on that particular field.
 
 3. construct a functionquery that scores either 0 (no match)  or x (not sure
 what x is (1?) , but it should always be the same for all exact matches )
 
 This gives
 
 http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact
 v='j.k.rowling'}) desc
 
 which scores all exact matches before all partial matches.
 
 4. now just concatenate the other sorts giving:
 
 http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact
 v='j.k.rowling'}) desc, num_copies desc, num_comments desc
 
 That should do it.
 
 Please note that 'num_copies' and 'num_comments' still kick in to break the
 tie for documents that exactly match on 'author_exact'. I assume this is
 ok.
 
 I can't see a way to do it without functionqueries at the moment, which
 doesn't mean there isn't any.
 
 Hope that helps,
 
 Geert-Jan
 
 
 
 
 
 
 
 *query({!dismax qf=text v='solr rocks'})*
 *
 *
 
 
 
 
 2010/11/24 Robert Gründler rob...@dubture.com
 
 Hi,
 
 we have a requirement for one of our search results which has a quite
 complex sorting strategy. Let me explain the document first, using an
 example:
 
 The document is a book. It has several indexed text fields: Title, Author,
 Distributor. It has two integer columns, where one reflects the number of
 sold copies (num_copies), and the other reflects
 the number of comments on the website (num_comments).
 
 The Requirement for the relevancy looks like this:
 
 * Documents which have exact matches in the Author field, should be
 ranked highest, disregarding their values in num_copies and num_comments
 fields
 * After the exact matches, the sorting should be based on the value in the
 field num_copies, but only for documents, where this field is set
 * After the num_copies matches, the sorting should be based on
 num_comments
 
 I'm wondering is this kind of sort order can be implemented in a single
 query, or if i need to break it down into several queries and merge the
 results on application level.
 
 -robert
 
 
 



Respect token order in matches

2010-11-18 Thread Robert Gründler
Hi,

is there a way to make solr respect the order of token matches when the query 
is a multi-term string?

Here's an example:

Query String: John C

Indexed Strings:

- John Cage
- Cargill John

This will return both indexed strings as a result. However, Cargill John 
should not match in that case, because the order 
of the tokens is not the same as in the query.

Here's the fieldtype:

  fieldType name=edgytext class=solr.TextField positionIncrementGap=100

   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer

   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
replacement= replace=all /
   /analyzer

  /fieldType

Is there a way to achieve this using this fieldtype?


thanks!






LockReleaseFailedException

2010-11-18 Thread Robert Gründler
Hi,

i'm suddenly getting a LockReleaseFailedException when starting a full-import 
using the Dataimporthandler:

org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a 
NativeFSLock which is held by another indexer component


This worked without problems until just now. Is there some lock file i can 
remove to unlock the index again?


thanks.

-robert





Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler
thanks for the explanation.

the results for the autocompletion are pretty good now, but we still have a 
small problem. 

When there are hits in the edgytext2 fields, results which only have hits in 
the edgytext field
should not be returned at all.

Example:

Query: Martin Sco

Current Results (in that order):

- Martin Scorsese
- Martin Lawrence
- Joseph Martin

However, in an autocompletion context, only Martin Scorsese makes sense, the 
2 others are logically
not correct.

I'm not sure if this can be solved on the solr side, or if we should implement 
the logic in the
application.


thanks!

-robert







On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:

 Without the parens, the edgytext: only applied to Mr, the default field 
 still applied to Scorcese.
 
 The double quotes are neccesary in the second case (rather than parens), 
 because on a non-tokenized field because the standard query parser will 
 pre-tokenize on whitespace before sending individual white-space seperated 
 words to match the index. If the index includes multi-word tokens with 
 internal whitespace, they will never match. But the standard query parser 
 doesn't pre-tokenize like this, it passes the whole phrase to the index 
 intact.
 
 Robert Gründler wrote:
 Did you run your query without using () and  operators? If yes can you 
 try this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

 
 I didn't use () and  in my query before. Using the query with those 
 operators
 works now, stopwords are thrown out as the should, thanks.
 
 However, i don't understand how the () and  operators affect the 
 StopWordFilter.
 
 Could you give a brief explanation for the above example?
 
 thanks!
 
 
 -robert
 
 
 
 
 
  



Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler
it seems adding the '+' (required) operator to each term in a multi-term query 
does the trick:

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+

ie: edgytext2:(+Martin +Sco)


-robert



On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote:

 thanks for the explanation.
 
 the results for the autocompletion are pretty good now, but we still have a 
 small problem. 
 
 When there are hits in the edgytext2 fields, results which only have hits 
 in the edgytext field
 should not be returned at all.
 
 Example:
 
 Query: Martin Sco
 
 Current Results (in that order):
 
 - Martin Scorsese
 - Martin Lawrence
 - Joseph Martin
 
 However, in an autocompletion context, only Martin Scorsese makes sense, 
 the 2 others are logically
 not correct.
 
 I'm not sure if this can be solved on the solr side, or if we should 
 implement the logic in the
 application.
 
 
 thanks!
 
 -robert
 
 
 
 
 
 
 
 On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:
 
 Without the parens, the edgytext: only applied to Mr, the default field 
 still applied to Scorcese.
 
 The double quotes are neccesary in the second case (rather than parens), 
 because on a non-tokenized field because the standard query parser will 
 pre-tokenize on whitespace before sending individual white-space seperated 
 words to match the index. If the index includes multi-word tokens with 
 internal whitespace, they will never match. But the standard query parser 
 doesn't pre-tokenize like this, it passes the whole phrase to the index 
 intact.
 
 Robert Gründler wrote:
 Did you run your query without using () and  operators? If yes can you 
 try this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0
 
 
 I didn't use () and  in my query before. Using the query with those 
 operators
 works now, stopwords are thrown out as the should, thanks.
 
 However, i don't understand how the () and  operators affect the 
 StopWordFilter.
 
 Could you give a brief explanation for the above example?
 
 thanks!
 
 
 -robert
 
 
 
 
 
 
 



EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
Hi,

consider the following fieldtype (used for autocompletion):

  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This works fine as long as the query string is a single word. For multiple 
words, the ranking is weird though.

Example:

Query String: Bill Cl

Result (in that order):

- Clyde Phillips
- Clay Rogers
- Roger Cloud
- Bill Clinton

Bill Clinton should have the highest rank in that case.  

Has anyone an idea how to to configure this fieldtype to make matches in both 
tokens rank higher than those who match in either token?


thanks!


-robert





Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
thanks a lot, that setup works pretty well now.

the only problem now is that the StopWords do not work that good anymore. I'll 
provide an example, but first the 2 fieldtypes:

  !-- autocomplete field which finds matches inside strings (scor matches 
Martin Scorsese) --
  
  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType
  
  !-- autocomplete field which finds startsWith matches only (scor matches 
only Scorpio, but not Martin Scorsese) --  

  fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This setup now makes troubles regarding StopWords, here's an example:

Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin 
Scorsese. Mr is in the stopword list.

Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0

This way, the only result i get is Mr Martin Scorsese, because the strict 
field edgytext2 is boosted by 2.0. 

Any idea why in this case Martin Scorsese is not in the result at all?


thanks again!


-robert






On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:

 You can add an additional field, with using KeywordTokenizerFactory instead 
 of WhitespaceTokenizerFactory. And query both these fields with an OR 
 operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a single
 word. For multiple words, the ranking is weird though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype to
 make matches in both tokens rank higher than those who match
 in either token?
 
 
 thanks!
 
 
 -robert
 
 
 
 
 
 
 



Re: Concatenate multiple tokens into one

2010-11-11 Thread Robert Gründler
I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
This works fine, but i
realized that what i wanted to achieve is implemented easier in another way (by 
using 2 separate field types).

Have a look at a previous mail i wrote to the list and the reply from Ahmet 
Arslan (topic: EdgeNGram relevancy).


best


-robert




See 
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

 Hi Robert, All,
 
 I have a similar problem, here is my fieldType, 
 http://paste.pocoo.org/show/289910/
 I want to include stopword removal and lowercase the incoming terms. The idea 
 being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the 
 EdgeNgram filter factory.
 If anyone can tell me a simple way to concatenate tokens into one token 
 again, similar too the KeyWordTokenizer that would be super helpful.
 
 Many thanks
 
 Nick
 
 On 11 Nov 2010, at 00:23, Robert Gründler wrote:
 
 
 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
 
 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 
 
 in our case i think it makes sense. the content is targetting the electronic 
 music / dj scene, so we have a lot of words like DJ or featuring which
 make sense to throw out of the query. Also searches for the beastie boys 
 and beastie boys should return a match in the autocompletion.
 
 
 And if you don't... then why use the WhitespaceTokenizer and then try to 
 jam the tokens back together? Why not just NOT tokenize in the first place. 
 Use the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
 one token from the entire input string. 
 
 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.
 
 For now, i've come up with a quick-and-dirty custom ConcatFilter, which 
 does what i'm after:
 
 public class ConcatFilter extends TokenFilter {
 
  private TokenStream tstream;
 
  protected ConcatFilter(TokenStream input) {
  super(input);
  this.tstream = input;
  }
 
  @Override
  public Token next() throws IOException {
  
  Token token = new Token();
  StringBuilder builder = new StringBuilder();
  
  TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
  TypeAttribute typeAttribute = (TypeAttribute) 
 tstream.getAttribute(TypeAttribute.class);
  
  boolean incremented = false;
  
  while (tstream.incrementToken()) {
  
  if (typeAttribute.type().equals(word)) {
  builder.append(termAttribute.term());   
 
  }
  incremented = true;
  }
  
  token.setTermBuffer(builder.toString());
  
  if (incremented == true)
  return token;
  
  return null;
  }
 }
 
 I'm not sure if this is a safe way to do this, as i'm not familar with the 
 whole solr/lucene implementation after all.
 
 
 best
 
 
 -robert
 
 
 
 
 
 Then lowercase, remove whitespace (or not), do whatever else you want to do 
 to your single token to normalize it, and then edgengram it. 
 
 If you include whitespace in the token, then when making your queries for 
 auto-complete, be sure to use a query parser that doesn't do 
 pre-tokenization, the 'field' query parser should work well for this. 
 
 Jonathan
 
 
 
 
 From: Robert Gründler [rob...@dubture.com]
 Sent: Wednesday, November 10, 2010 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Concatenate multiple tokens into one
 
 Hi,
 
 i've created the following filterchain in a field type, the idea is to use 
 it for autocompletion purposes:
 
 tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens 
 separated by whitespace --
 filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything --
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /  !-- throw out 
 stopwords --
 filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
 replacement= replace=all /  !-- throw out all everything except a-z 
 --
 
 !-- actually, here i would like to join multiple tokens together again, to 
 provide one token for the EdgeNGramFilterFactory --
 
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches 
 --
 
 With that kind of filterchain, the EdgeNGramFilterFactory will receive 
 multiple tokens on input strings with whitespaces in it. This leads to the 
 following results:
 Input Query: George Cloo
 Matches:
 - George Harrison
 - John Clooridge
 - George Smith
 -George Clooney
 - etc

Re: Concatenate multiple tokens into one

2010-11-11 Thread Robert Gründler
this is the full source code, but be warned, i'm not a java developer, and i 
have no background in lucine/solr development:

// ConcatFilter

import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class ConcatFilter extends TokenFilter {

  protected ConcatFilter(TokenStream input) 
  {
super(input);   
  }

  @Override
  public Token next() throws IOException 
  {
Token token = new Token();
StringBuilder builder = new StringBuilder();

TermAttribute termAttribute = (TermAttribute) 
input.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute) 
input.getAttribute(TypeAttribute.class);

boolean hasToken = false;

while (input.incrementToken()) 
{
  if (typeAttribute.type().equals(word)) {
builder.append(termAttribute.term());
hasToken = true;
  } 
}

if (hasToken == true) {
  token.setTermBuffer(builder.toString());
  return token;
}
  
return null;
  }
}

//ConcatFilterFactory:

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenFilterFactory;

public class ConcatFilterFactory extends BaseTokenFilterFactory {

@Override
public TokenStream create(TokenStream stream) {
return new ConcatFilter(stream);
}
}

and in your schema.xml, you can simply add the filterfactory using this element:

filter class=com.example.ConcatFilterFactory /

Jar files i have included in the buildpath (can be found in the solr download 
package):

apache-solr-core-1.4.1.jar
lucene-analyzers-2.9.3.jar
lucene-core.2.9.3-jar


good luck ;)


-robert




On Nov 11, 2010, at 8:45 PM, Nick Martin wrote:

 Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm 
 not sure what i need in the classpath and where Token comes from.
 Will check the thread you mention.
 
 Best
 
 Nick
 
 On 11 Nov 2010, at 18:13, Robert Gründler wrote:
 
 I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
 This works fine, but i
 realized that what i wanted to achieve is implemented easier in another way 
 (by using 2 separate field types).
 
 Have a look at a previous mail i wrote to the list and the reply from Ahmet 
 Arslan (topic: EdgeNGram relevancy).
 
 
 best
 
 
 -robert
 
 
 
 
 See 
 On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
 
 Hi Robert, All,
 
 I have a similar problem, here is my fieldType, 
 http://paste.pocoo.org/show/289910/
 I want to include stopword removal and lowercase the incoming terms. The 
 idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the 
 EdgeNgram filter factory.
 If anyone can tell me a simple way to concatenate tokens into one token 
 again, similar too the KeyWordTokenizer that would be super helpful.
 
 Many thanks
 
 Nick
 
 On 11 Nov 2010, at 00:23, Robert Gründler wrote:
 
 
 On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
 
 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 
 
 in our case i think it makes sense. the content is targetting the 
 electronic music / dj scene, so we have a lot of words like DJ or 
 featuring which
 make sense to throw out of the query. Also searches for the beastie boys 
 and beastie boys should return a match in the autocompletion.
 
 
 And if you don't... then why use the WhitespaceTokenizer and then try to 
 jam the tokens back together? Why not just NOT tokenize in the first 
 place. Use the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
 creates one token from the entire input string. 
 
 I started out with the KeywordTokenizer, which worked well, except the 
 StopWord problem.
 
 For now, i've come up with a quick-and-dirty custom ConcatFilter, which 
 does what i'm after:
 
 public class ConcatFilter extends TokenFilter {
 
private TokenStream tstream;
 
protected ConcatFilter(TokenStream input) {
super(input);
this.tstream = input;
}
 
@Override
public Token next() throws IOException {

Token token = new Token();
StringBuilder builder = new StringBuilder();

TermAttribute termAttribute = (TermAttribute) 
 tstream.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute) 
 tstream.getAttribute(TypeAttribute.class);

boolean incremented = false;

while (tstream.incrementToken()) {

if (typeAttribute.type().equals(word

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
according to the fieldtype i posted previously, i think it's because of:

1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: 
Clyde and Phillips
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: 
C Cl Cly ...   AND  P Ph Phi ...

The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the 
WhitespaceTokenizer.

This creates a match for the 2nd token Ci of the query, and one of the 
subtokens the EdgeNGramFilter created: Cl.


-robert




On Nov 11, 2010, at 21:34 , Andy wrote:

 Could anyone help me understand what does Clyde Phillips appear in the 
 results for Bill Cl??
 
 Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so 
 why is it even in the results?
 
 Thanks.
 
 --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote:
 
 You can add an additional field, with
 using KeywordTokenizerFactory instead of
 WhitespaceTokenizerFactory. And query both these fields with
 an OR operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes
 first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
 wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext
 class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 /
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a
 single
 word. For multiple words, the ranking is weird
 though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype
 to
 make matches in both tokens rank higher than those who
 match
 in either token?
 
 
 thanks!
 
 
 -robert
 
 
 
 
 
 
 
 
 
 
 



Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
 
 Did you run your query without using () and  operators? If yes can you try 
 this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

I didn't use () and  in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and  operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert






Best practices to rebuild index on live system

2010-11-11 Thread Robert Gründler
Hi again,

we're coming closer to the rollout of our newly created solr/lucene based 
search, and i'm wondering
how people handle changes to their schema on live systems. 

In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 
hours for a full dataimport from the relational
database. The Index is being updated in realtime, through post 
insert/update/delete events in our ORM.

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.

Does Solr provide any built-in approaches to this problem?


best

-robert





Concatenate multiple tokens into one

2010-11-10 Thread Robert Gründler
Hi,

i've created the following filterchain in a field type, the idea is to use it 
for autocompletion purposes:

tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens 
separated by whitespace --
filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything --
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true /  !-- throw out stopwords --
filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
replacement= replace=all /  !-- throw out all everything except a-z --

!-- actually, here i would like to join multiple tokens together again, to 
provide one token for the EdgeNGramFilterFactory --

filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / 
!-- create edgeNGram tokens for autocomplete matches --

With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple 
tokens on input strings with whitespaces in it. This leads to the following 
results:
Input Query: George Cloo
Matches:
- George Harrison
- John Clooridge
- George Smith
-George Clooney
- etc

However, only George Clooney should match in the autocompletion use case.
Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
concatenates all the tokens generated by the WhitespaceTokenizerFactory.
Are there filters which can do such a thing?

If not, are there examples how to implement a custom TokenFilter?

thanks!

-robert


 



Re: Concatenate multiple tokens into one

2010-11-10 Thread Robert Gründler

On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:

 Are you sure you really want to throw out stopwords for your use case?  I 
 don't think autocompletion will work how you want if you do. 

in our case i think it makes sense. the content is targetting the electronic 
music / dj scene, so we have a lot of words like DJ or featuring which
make sense to throw out of the query. Also searches for the beastie boys and 
beastie boys should return a match in the autocompletion.

 
 And if you don't... then why use the WhitespaceTokenizer and then try to jam 
 the tokens back together? Why not just NOT tokenize in the first place. Use 
 the KeywordTokenizer, which really should be called the 
 NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
 one token from the entire input string. 

I started out with the KeywordTokenizer, which worked well, except the StopWord 
problem.

For now, i've come up with a quick-and-dirty custom ConcatFilter, which does 
what i'm after:

public class ConcatFilter extends TokenFilter {

private TokenStream tstream;

protected ConcatFilter(TokenStream input) {
super(input);
this.tstream = input;
}

@Override
public Token next() throws IOException {

Token token = new Token();
StringBuilder builder = new StringBuilder();

TermAttribute termAttribute = (TermAttribute) 
tstream.getAttribute(TermAttribute.class);
TypeAttribute typeAttribute = (TypeAttribute) 
tstream.getAttribute(TypeAttribute.class);

boolean incremented = false;

while (tstream.incrementToken()) {

if (typeAttribute.type().equals(word)) {
builder.append(termAttribute.term());   

}
incremented = true;
}

token.setTermBuffer(builder.toString());

if (incremented == true)
return token;

return null;
}
}

I'm not sure if this is a safe way to do this, as i'm not familar with the 
whole solr/lucene implementation after all.


best


-robert




 
 Then lowercase, remove whitespace (or not), do whatever else you want to do 
 to your single token to normalize it, and then edgengram it. 
 
 If you include whitespace in the token, then when making your queries for 
 auto-complete, be sure to use a query parser that doesn't do 
 pre-tokenization, the 'field' query parser should work well for this. 
 
 Jonathan
 
 
 
 
 From: Robert Gründler [rob...@dubture.com]
 Sent: Wednesday, November 10, 2010 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Concatenate multiple tokens into one
 
 Hi,
 
 i've created the following filterchain in a field type, the idea is to use it 
 for autocompletion purposes:
 
 tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens 
 separated by whitespace --
 filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything --
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /  !-- throw out 
 stopwords --
 filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) 
 replacement= replace=all /  !-- throw out all everything except a-z --
 
 !-- actually, here i would like to join multiple tokens together again, to 
 provide one token for the EdgeNGramFilterFactory --
 
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 
 / !-- create edgeNGram tokens for autocomplete matches --
 
 With that kind of filterchain, the EdgeNGramFilterFactory will receive 
 multiple tokens on input strings with whitespaces in it. This leads to the 
 following results:
 Input Query: George Cloo
 Matches:
 - George Harrison
 - John Clooridge
 - George Smith
 -George Clooney
 - etc
 
 However, only George Clooney should match in the autocompletion use case.
 Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
 concatenates all the tokens generated by the WhitespaceTokenizerFactory.
 Are there filters which can do such a thing?
 
 If not, are there examples how to implement a custom TokenFilter?
 
 thanks!
 
 -robert
 
 
 
 



Dataimporthandler crashed raidcontroller

2010-11-04 Thread Robert Gründler
Hi all,

we had a severe problem with our raidcontroller on one of our servers today 
during importing a table with ~8 million rows into a solr index. After 
importing about 4 million
documents, our server shutdown, and failed to restart due to a corrupt raid
disk. 

The Solr data import was the only heavy process running on that machine during
the crash.

Has anyone experienced hdd/raid-related problems during indexing large sql 
databases into solr?


thanks!


-robert