Tokenizing and searching named character entity references

2008-07-24 Thread F Knudson

Greetings:

I am working with many different data sources - some source employ entity
references ; others do not.  My goal is to make the searching across
sources as consistent as possible.

Example text - 

Source1:   weakening Hdelta; absorption
Source1:   zero-field gap omega;

Source2:  weakening H delta absorption
Source2:  zero-field gap omega

Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 -
the entity is replaced with the named character entity - 

This works great.  

But I want the searching tokens to be identical for each source.  I need to
capture delta;  as a token.


fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
   tokenizer class=solr.HTMLStripWhitespaceTokenizerFactory/ 
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
   filter class=solr.ISOLatin1AccentFilterFactory/
   filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateA
ll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
 
Is this possible with the SOLR supplied tokenizers?  I experimented with
different combinations and orders and was not successful.

Is this possible using synonyms?  I also experimented with this route but
again was not successful.

Do I need to create a custom tokenizer?

Thanks
Frances
-- 
View this message in context: 
http://www.nabble.com/Tokenizing-and-searching-named-character-entity-references-tp18632403p18632403.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Optimization taking days/weeks

2008-02-29 Thread F Knudson

We will review the java settings.  The current settings are a bit low - but
the indexed typically does not reach even 50% of the allocated 1024MB Max
Heap.

Yes the index is large - only 3 fields are stored - and I have set the
positionIncrementGap to 50 (down from 100) in an attempt to reduce index
size.  Would you suggest to build one index used only for searching and a
second index used only for display?  Does that fit within your definition of
partition?

Thanks
Frances


Alex Benjamen wrote:
 
 This sounds too familiar... 
 
java settings used - java -Xmx1024M -Xms1024M  
 Sounds like your settings are pretty low... if you're using 64bit JVM, you
 should be able to set 
 these much higher, maybe give it like 8gb. 
 
 Another thing, you may want to look at reducing the index size... is there
 any way you could 
 partition the index? Also only index fields which you need and do not
 store the values in the index.
 I've originally had an index which was 50Gb in size, and after removing
 fields I do not need, I'm down
 to 8Gb and not storing any values in the index.
  
  
 
 

-- 
View this message in context: 
http://www.nabble.com/Optimization-taking-days-weeks-tp15738090p15762156.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Optimization taking days/weeks

2008-02-29 Thread F Knudson

We are a bit concerned regarding the index size.  At least no response (so
far) as indicated that the size is unmanagable.  We killed the process -
will move to Java6 - and will use vmstat to monitor the new optimization
process. 
At what index size would you begin to worry?  Or is it a combination of
index size, optimization time, and response time?
We are data rich here!
Thanks
Frances


Otis Gospodnetic wrote:
 
 That's a tiny little index there ;)  Circa 100GB?
  
 What do you see if you run vmstat 2 while the optimization is happening?
 Non-idle CPU?  A pile of IO?  Is there a reason for such a small heap on a
 machine with 32GB of RAM?
 
 Otis
 
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: F Knudson [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Thursday, February 28, 2008 9:54:50 AM
 Subject: Optimization taking days/weeks
 
 
 Optimization time on solr index has turned into days/weeks.
 We are using solr 1.2.
 We use one box to build/optimize indexes. This index is copied to another
 box for searching purposes.
 We welcome suggestions/comments, etc.  We are a bit stumped on this.
 Details are below.
 
 Box details
 Proc: 8 Dual Core 2.6GHz
 Mem: 32 GB
 OS: Red Hat Linux Enterprise 4
 Kernel: 2.6.9-55.0.12.ELlargesmp
 
 These are details from the index currently in use.  Search response time
 is
 very acceptable (searchers are very happy)
 Optimization time - 10433  (12/11/07)
 index size - 229486464
 # of records - 84960570 
 index directory
 flasher# ls -l
 total 229486464
 -rw-r--r--   1 flknud   staff22197926593 Dec 12 08:07 _2bl6.fdt
 -rw-r--r--   1 flknud   staff679684560 Dec 12 08:20 _2bl6.fdx
 -rw-r--r--   1 flknud   staff208 Dec 12 08:23 _2bl6.fnm
 -rw-r--r--   1 flknud   staff40176405625 Dec 12 09:28 _2bl6.frq
 -rw-r--r--   1 flknud   staff594723994 Dec 12 09:41 _2bl6.nrm
 -rw-r--r--   1 flknud   staff47616340310 Dec 12 12:07 _2bl6.prx
 -rw-r--r--   1 flknud   staff76708079 Dec 12 12:25 _2bl6.tii
 -rw-r--r--   1 flknud   staff6154384415 Dec 12 12:42 _2bl6.tis
 -rw-r--r--   1 flknud   staff 20 Dec 12 12:48 segments.gen
 -rw-r--r--   1 flknud   staff 44 Dec 12 12:48 segments_2c64
 --
 
 current directory listing
 indexed new records - Jan 22 and Jan 27
 # of records - 85032470
 optimization time - 558188
 
 There were no out of memory errors.  There was 800961792KB  left in the
 directory.  The files were not 
 collapsed as expected.  There are still files dated Jan 22 and Jan 27.  
 
 A new optimization was started Feb. 11 and continues.
 This is a snapshot of the index directory.
 
 We have at least another million records to add.  Plus weekly updates of
 approximately 103K records.
 We are using the direct indexing method.
 java settings used - java -Xmx1024M -Xms1024M 
 
 The files continue to grow so work is progressing.
 snapshot 2/21/08
 -bash-3.00$ ls -ltr
 total 205396680
 -rw-r--r--  1 flknud users 208 Jan 10 07:15 _2bm7.fnm
 -rw-r--r--  1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt
 -rw-r--r--  1 flknud users   679819760 Jan 10 08:09 _2bm7.fdx
 -rw-r--r--  1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq
 -rw-r--r--  1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx
 -rw-r--r--  1 flknud users  6155230704 Jan 16 18:16 _2bm7.tis
 -rw-r--r--  1 flknud users76704158 Jan 16 18:16 _2bm7.tii
 -rw-r--r--  1 flknud users   594842294 Jan 16 18:18 _2bm7.nrm
 -rw-r--r--  1 flknud users 208 Jan 22 08:57 _2bpa.fnm
 -rw-r--r--  1 flknud users10806426 Jan 22 08:57 _2bpa.fdt
 -rw-r--r--  1 flknud users  371200 Jan 22 08:57 _2bpa.fdx
 -rw-r--r--  1 flknud users21114330 Jan 22 08:57 _2bpa.frq
 -rw-r--r--  1 flknud users25683573 Jan 22 08:57 _2bpa.prx
 -rw-r--r--  1 flknud users 9225592 Jan 22 08:57 _2bpa.tis
 -rw-r--r--  1 flknud users  118660 Jan 22 08:57 _2bpa.tii
 -rw-r--r--  1 flknud users  324804 Jan 22 08:57 _2bpa.nrm
 -rw-r--r--  1 flknud users 198 Jan 22 09:00 _2bpl.fnm
 -rw-r--r--  1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt
 -rw-r--r--  1 flknud users   36800 Jan 22 09:00 _2bpl.fdx
 -rw-r--r--  1 flknud users 2646708 Jan 22 09:00 _2bpl.frq
 -rw-r--r--  1 flknud users 3781824 Jan 22 09:00 _2bpl.prx
 -rw-r--r--  1 flknud users 1429176 Jan 22 09:00 _2bpl.tis
 -rw-r--r--  1 flknud users   18582 Jan 22 09:00 _2bpl.tii
 -rw-r--r--  1 flknud users   32204 Jan 22 09:00 _2bpl.nrm
 -rw-r--r--  1 flknud users 198 Jan 22 09:01 _2bpm.fnm
 -rw-r--r--  1 flknud users  121716 Jan 22 09:01 _2bpm.fdt
 -rw-r--r--  1 flknud users3200 Jan 22 09:01 _2bpm.fdx
 -rw-r--r--  1 flknud users  205961 Jan 22 09:01 _2bpm.frq
 -rw-r--r--  1 flknud users  302114 Jan 22 09:01 _2bpm.prx
 -rw-r--r--  1 flknud users  233641 Jan 22 09:01 _2bpm.tis
 -rw-r--r--  1 flknud users3036 Jan 22 09:01 _2bpm.tii
 -rw-r--r--  1 flknud users2804 Jan 22 09:01

Re: Optimization taking days/weeks

2008-02-29 Thread F Knudson

Yes indeed - it was spending all of its time in garbage collection.  We will
be moving to Java6.
Thanks for your suggestion.

Frances


Yonik Seeley wrote:
 
 Have you checked if this is due to running out of heap memory?
 When that happens, the garbage collector can start taking a lot of CPU.
 If you are using a Java6 JVM, it should have management enabled by
 default and you should be able to connect to it via jconsole and
 check.
 
 -Yonik
 
 On Thu, Feb 28, 2008 at 9:54 AM, F Knudson [EMAIL PROTECTED] wrote:

  Optimization time on solr index has turned into days/weeks.
  We are using solr 1.2.
  We use one box to build/optimize indexes. This index is copied to
 another
  box for searching purposes.
  We welcome suggestions/comments, etc.  We are a bit stumped on this.
  Details are below.

  Box details
  Proc: 8 Dual Core 2.6GHz
  Mem: 32 GB
  OS: Red Hat Linux Enterprise 4
  Kernel: 2.6.9-55.0.12.ELlargesmp

  These are details from the index currently in use.  Search response time
 is
  very acceptable (searchers are very happy)
  Optimization time - 10433  (12/11/07)
  index size - 229486464
  # of records - 84960570
  index directory
  flasher# ls -l
  total 229486464
  -rw-r--r--   1 flknud   staff22197926593 Dec 12 08:07 _2bl6.fdt
  -rw-r--r--   1 flknud   staff679684560 Dec 12 08:20 _2bl6.fdx
  -rw-r--r--   1 flknud   staff208 Dec 12 08:23 _2bl6.fnm
  -rw-r--r--   1 flknud   staff40176405625 Dec 12 09:28 _2bl6.frq
  -rw-r--r--   1 flknud   staff594723994 Dec 12 09:41 _2bl6.nrm
  -rw-r--r--   1 flknud   staff47616340310 Dec 12 12:07 _2bl6.prx
  -rw-r--r--   1 flknud   staff76708079 Dec 12 12:25 _2bl6.tii
  -rw-r--r--   1 flknud   staff6154384415 Dec 12 12:42 _2bl6.tis
  -rw-r--r--   1 flknud   staff 20 Dec 12 12:48 segments.gen
  -rw-r--r--   1 flknud   staff 44 Dec 12 12:48 segments_2c64
  --

  current directory listing
  indexed new records - Jan 22 and Jan 27
  # of records - 85032470
  optimization time - 558188

  There were no out of memory errors.  There was 800961792KB  left in the
  directory.  The files were not
  collapsed as expected.  There are still files dated Jan 22 and Jan 27.

  A new optimization was started Feb. 11 and continues.
  This is a snapshot of the index directory.

  We have at least another million records to add.  Plus weekly updates of
  approximately 103K records.
  We are using the direct indexing method.
  java settings used - java -Xmx1024M -Xms1024M

  The files continue to grow so work is progressing.
  snapshot 2/21/08
  -bash-3.00$ ls -ltr
  total 205396680
  -rw-r--r--  1 flknud users 208 Jan 10 07:15 _2bm7.fnm
  -rw-r--r--  1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt
  -rw-r--r--  1 flknud users   679819760 Jan 10 08:09 _2bm7.fdx
  -rw-r--r--  1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq
  -rw-r--r--  1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx
  -rw-r--r--  1 flknud users  6155230704 Jan 16 18:16 _2bm7.tis
  -rw-r--r--  1 flknud users76704158 Jan 16 18:16 _2bm7.tii
  -rw-r--r--  1 flknud users   594842294 Jan 16 18:18 _2bm7.nrm
  -rw-r--r--  1 flknud users 208 Jan 22 08:57 _2bpa.fnm
  -rw-r--r--  1 flknud users10806426 Jan 22 08:57 _2bpa.fdt
  -rw-r--r--  1 flknud users  371200 Jan 22 08:57 _2bpa.fdx
  -rw-r--r--  1 flknud users21114330 Jan 22 08:57 _2bpa.frq
  -rw-r--r--  1 flknud users25683573 Jan 22 08:57 _2bpa.prx
  -rw-r--r--  1 flknud users 9225592 Jan 22 08:57 _2bpa.tis
  -rw-r--r--  1 flknud users  118660 Jan 22 08:57 _2bpa.tii
  -rw-r--r--  1 flknud users  324804 Jan 22 08:57 _2bpa.nrm
  -rw-r--r--  1 flknud users 198 Jan 22 09:00 _2bpl.fnm
  -rw-r--r--  1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt
  -rw-r--r--  1 flknud users   36800 Jan 22 09:00 _2bpl.fdx
  -rw-r--r--  1 flknud users 2646708 Jan 22 09:00 _2bpl.frq
  -rw-r--r--  1 flknud users 3781824 Jan 22 09:00 _2bpl.prx
  -rw-r--r--  1 flknud users 1429176 Jan 22 09:00 _2bpl.tis
  -rw-r--r--  1 flknud users   18582 Jan 22 09:00 _2bpl.tii
  -rw-r--r--  1 flknud users   32204 Jan 22 09:00 _2bpl.nrm
  -rw-r--r--  1 flknud users 198 Jan 22 09:01 _2bpm.fnm
  -rw-r--r--  1 flknud users  121716 Jan 22 09:01 _2bpm.fdt
  -rw-r--r--  1 flknud users3200 Jan 22 09:01 _2bpm.fdx
  -rw-r--r--  1 flknud users  205961 Jan 22 09:01 _2bpm.frq
  -rw-r--r--  1 flknud users  302114 Jan 22 09:01 _2bpm.prx
  -rw-r--r--  1 flknud users  233641 Jan 22 09:01 _2bpm.tis
  -rw-r--r--  1 flknud users3036 Jan 22 09:01 _2bpm.tii
  -rw-r--r--  1 flknud users2804 Jan 22 09:01 _2bpm.nrm
  -rw-r--r--  1 flknud users 198 Jan 27 14:00 _2bpn.fnm
  -rw-r--r--  1 flknud users  227962 Jan 27 14:00 _2bpn.fdt
  -rw-r--r--  1 flknud users7200 Jan 27 14:00 _2bpn.fdx
  -rw-r--r--  1 flknud users  437798 Jan 27 14:00 _2bpn.frq
  -rw-r--r--  1 flknud users  593858 Jan 27 14:00

Optimization taking days/weeks

2008-02-28 Thread F Knudson

Optimization time on solr index has turned into days/weeks.
We are using solr 1.2.
We use one box to build/optimize indexes. This index is copied to another
box for searching purposes.
We welcome suggestions/comments, etc.  We are a bit stumped on this.
Details are below.

Box details
Proc: 8 Dual Core 2.6GHz
Mem: 32 GB
OS: Red Hat Linux Enterprise 4
Kernel: 2.6.9-55.0.12.ELlargesmp

These are details from the index currently in use.  Search response time is
very acceptable (searchers are very happy)
Optimization time - 10433  (12/11/07)
index size - 229486464
# of records - 84960570 
index directory
flasher# ls -l
total 229486464
-rw-r--r--   1 flknud   staff22197926593 Dec 12 08:07 _2bl6.fdt
-rw-r--r--   1 flknud   staff679684560 Dec 12 08:20 _2bl6.fdx
-rw-r--r--   1 flknud   staff208 Dec 12 08:23 _2bl6.fnm
-rw-r--r--   1 flknud   staff40176405625 Dec 12 09:28 _2bl6.frq
-rw-r--r--   1 flknud   staff594723994 Dec 12 09:41 _2bl6.nrm
-rw-r--r--   1 flknud   staff47616340310 Dec 12 12:07 _2bl6.prx
-rw-r--r--   1 flknud   staff76708079 Dec 12 12:25 _2bl6.tii
-rw-r--r--   1 flknud   staff6154384415 Dec 12 12:42 _2bl6.tis
-rw-r--r--   1 flknud   staff 20 Dec 12 12:48 segments.gen
-rw-r--r--   1 flknud   staff 44 Dec 12 12:48 segments_2c64
--

current directory listing
indexed new records - Jan 22 and Jan 27
# of records - 85032470
optimization time - 558188

There were no out of memory errors.  There was 800961792KB  left in the
directory.  The files were not 
collapsed as expected.  There are still files dated Jan 22 and Jan 27.  

A new optimization was started Feb. 11 and continues.
This is a snapshot of the index directory.

We have at least another million records to add.  Plus weekly updates of
approximately 103K records.
We are using the direct indexing method.
java settings used - java -Xmx1024M -Xms1024M 

The files continue to grow so work is progressing.
snapshot 2/21/08
-bash-3.00$ ls -ltr
total 205396680
-rw-r--r--  1 flknud users 208 Jan 10 07:15 _2bm7.fnm
-rw-r--r--  1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt
-rw-r--r--  1 flknud users   679819760 Jan 10 08:09 _2bm7.fdx
-rw-r--r--  1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq
-rw-r--r--  1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx
-rw-r--r--  1 flknud users  6155230704 Jan 16 18:16 _2bm7.tis
-rw-r--r--  1 flknud users76704158 Jan 16 18:16 _2bm7.tii
-rw-r--r--  1 flknud users   594842294 Jan 16 18:18 _2bm7.nrm
-rw-r--r--  1 flknud users 208 Jan 22 08:57 _2bpa.fnm
-rw-r--r--  1 flknud users10806426 Jan 22 08:57 _2bpa.fdt
-rw-r--r--  1 flknud users  371200 Jan 22 08:57 _2bpa.fdx
-rw-r--r--  1 flknud users21114330 Jan 22 08:57 _2bpa.frq
-rw-r--r--  1 flknud users25683573 Jan 22 08:57 _2bpa.prx
-rw-r--r--  1 flknud users 9225592 Jan 22 08:57 _2bpa.tis
-rw-r--r--  1 flknud users  118660 Jan 22 08:57 _2bpa.tii
-rw-r--r--  1 flknud users  324804 Jan 22 08:57 _2bpa.nrm
-rw-r--r--  1 flknud users 198 Jan 22 09:00 _2bpl.fnm
-rw-r--r--  1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt
-rw-r--r--  1 flknud users   36800 Jan 22 09:00 _2bpl.fdx
-rw-r--r--  1 flknud users 2646708 Jan 22 09:00 _2bpl.frq
-rw-r--r--  1 flknud users 3781824 Jan 22 09:00 _2bpl.prx
-rw-r--r--  1 flknud users 1429176 Jan 22 09:00 _2bpl.tis
-rw-r--r--  1 flknud users   18582 Jan 22 09:00 _2bpl.tii
-rw-r--r--  1 flknud users   32204 Jan 22 09:00 _2bpl.nrm
-rw-r--r--  1 flknud users 198 Jan 22 09:01 _2bpm.fnm
-rw-r--r--  1 flknud users  121716 Jan 22 09:01 _2bpm.fdt
-rw-r--r--  1 flknud users3200 Jan 22 09:01 _2bpm.fdx
-rw-r--r--  1 flknud users  205961 Jan 22 09:01 _2bpm.frq
-rw-r--r--  1 flknud users  302114 Jan 22 09:01 _2bpm.prx
-rw-r--r--  1 flknud users  233641 Jan 22 09:01 _2bpm.tis
-rw-r--r--  1 flknud users3036 Jan 22 09:01 _2bpm.tii
-rw-r--r--  1 flknud users2804 Jan 22 09:01 _2bpm.nrm
-rw-r--r--  1 flknud users 198 Jan 27 14:00 _2bpn.fnm
-rw-r--r--  1 flknud users  227962 Jan 27 14:00 _2bpn.fdt
-rw-r--r--  1 flknud users7200 Jan 27 14:00 _2bpn.fdx
-rw-r--r--  1 flknud users  437798 Jan 27 14:00 _2bpn.frq
-rw-r--r--  1 flknud users  593858 Jan 27 14:00 _2bpn.prx
-rw-r--r--  1 flknud users  516031 Jan 27 14:00 _2bpn.tis
-rw-r--r--  1 flknud users6814 Jan 27 14:00 _2bpn.tii
-rw-r--r--  1 flknud users6304 Jan 27 14:00 _2bpn.nrm
-rw-r--r--  1 flknud users 198 Jan 27 14:01 _2bpo.fnm
-rw-r--r--  1 flknud users  231456 Jan 27 14:01 _2bpo.fdt
-rw-r--r--  1 flknud users7200 Jan 27 14:01 _2bpo.fdx
-rw-r--r--  1 flknud users  448401 Jan 27 14:01 _2bpo.frq
-rw-r--r--  1 flknud users  616557 Jan 27 14:01 _2bpo.prx
-rw-r--r--  1 flknud users  587697 Jan 27 14:01 _2bpo.tis
-rw-r--r--  1 flknud users7801 Jan 27 14:01 _2bpo.tii
-rw-r--r--  1 flknud users6304 Jan 27 14:01 _2bpo.nrm

Re: Letter-number transitions - can this be turned off

2007-10-02 Thread F Knudson

Thanks for your helpful suggestions.

I have considered other analyzers but WDF has great strengths.  I will
experiment with maintaining transitions and then consider modifying the
code.

F. Knudson


Mike Klaas wrote:
 
 On 30-Sep-07, at 12:47 PM, F Knudson wrote:
 

 Is there a flag to disable the letter-number transition in the
 solr.WordDelimiterFilterFactory?  We are indexing category codes,  
 thesaurus
 codes for which this letter number transition makes no sense.  It is
 bloating the indexing (which is already large).
 
 Have you considered using a different analyzer?
 
 If you want to continue using WDF, you could make a quick change  
 around since 320:
 
  if (splitOnCaseChange == 0 
  (lastType  ALPHA) != 0  (type  ALPHA) != 0) {
// ALPHA-ALPHA: always ignore if case isn't considered.
 
  } else if ((lastType  UPPER)!=0  (type  LOWER)!=0) {
// UPPER-LOWER: Don't split
  } else {
 
   ...
 
 by adding a clause that catches ALPHA - NUMERIC (and vice versa) and  
 ignores it.
 
 Another approach that I am using locally is to maintain the  
 transitions, but force tokens to be a minimum size (so r2d2 doesn't  
 tokenize to four tokens but arrrdeee does).
 
 There is a patch here: http://issues.apache.org/jira/browse/SOLR-293
 
 If you vote for it, I promise to get it in for 1.3 g
 
 -Mike
 
 

-- 
View this message in context: 
http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a13003019
Sent from the Solr - User mailing list archive at Nabble.com.