from:"Bernd Fehling"

HOWTO get a working copy of SOLR?

2010-06-15 Thread Bernd Fehling

Dear list,

this sounds stupid, but how to get a full working copy of SOLR?

What I have tried so far:
- started with LucidWorks SOLR. Installs fine, runs fine but has an old tika 
version
  and can only handle some PDFs.

- changed to SOLR trunk. Installs fine, runs fine but luke 1.0.1 argues about
  Unknown format version: -10. I guess because luke 1.0.1 compiles with
  lucene-core-3.0.1.jar but trunk has lucene-core-4.0-dev.jar ???
  Anyway, no luck with this version.

- changed to SOLR branch_3x. Installs fine, runs fine, luke works fine but
  the extraction with /update/extract (ExtractingRequestHandler) only replies
  the metadata but not the content.
  No luck with this version.

Is there any full working recent copy at all?

Or a luke working with SOLR trunk?

Regards,
Bernd

Re: HOWTO get a working copy of SOLR?

2010-06-16 Thread Bernd Fehling


Sixten Otto wrote:
 On Tue, Jun 15, 2010 at 12:58 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 - changed to SOLR branch_3x. Installs fine, runs fine, luke works fine but
  the extraction with /update/extract (ExtractingRequestHandler) only replies
  the metadata but not the content.
 
 Sounds like https://issues.apache.org/jira/browse/SOLR-1902
 

Thanks for the hint.

- The trunk is running fine (at least on my system) but has no luke.
- The branch has a running luke but doesn't extract the text.

What a pity. So is SOLR a serious development or just a playground or
test case for lucene?

Why is luke a separate tool and not combined / integrated with SOLR?

Very strange...

Bernd

Re: How to Debug Sol-Code in Eclipse ?!

2010-08-23 Thread Bernd Fehling



 
 can nobody help me or want :D

As already someone said:
- install Eclipse
- add Jetty Webapp Plugin to Eclipse
- add svn plugin to Eclipse
- download with svn the repository from trunk
- change to lucene dir and run ant package
- change to solr dir and run ant dist
- setup with Run configure... a Jetty Webapp for solr
- start debugging :-)

If debugging below solr level into lucene level just add
lucene src path to debugging source.

May be you should read:
http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse

Regards,
Bernd

Re: Different analyzers for dfferent documents in different languages?

2010-09-22 Thread Bernd Fehling

Actually, this is one of the biggest disadvantage of Solr for multilingual 
content.
Solr is field based which means you have to know the language _before_ you feed
the content to a specific field and process the content for that field.
This results in having separate fields for each language.
E.g. for Europe this will be 24 to 26 languages for each title, keyword, 
description, ...

I guess when they started with Lucene/Solr they never had multilingual on their 
mind.

The alternative is to have a separate index for each language.
Therefore you also have to know the language of the content _before_ feeding to 
the core.
E.g. again for Europe you end up with 24 to 26 cores.

Onother option is to see the multilingual fields (title, keywords, 
description,...) as
a subdocument. Write a filter class as subpipeline, use language and encoding 
detection
as first step in that pipeline and then go on with all other linguistic 
processing within
that pipeline and return the processed content back to the field for further 
filtering
and storing.

Many solutions, but nothing out off the box :-)

Bernd

Am 22.09.2010 12:01, schrieb Andy:
 I have documents that are in different languages. There's a field in the 
 documents specifying what language it's in.
 
 Is it possible to index the documents such that based on what language a 
 document is in, a different analyzer will be used on that document?
 
 What is the normal way to handle documents in different languages?
 
 Thanks
 Andy

Re: Migrating to Solr

2010-02-25 Thread Bernd Fehling

Hi list,

is this true, no downloaded copy of the documentprocessor
anywhere available?

Regards,
Bernd



Bernd Fehling schrieb:
 Was anyone able to get a copy of:
 http://sesat.no/svn/sesat-documentprocessor/
 
 Unfortunately it is offline.
 
 Would be pleased to get a copy.
 
 Regards,
 Bernd

Re: Query regarding solr custom sort order

2012-01-04 Thread Bernd Fehling


Hi,

I suggest using the following fieldType for your field:

fieldType name=sint class=solr.SortableIntField sortMissingLast=true 
omitNorms=true/

Regards
Bernd

Am 04.01.2012 14:40, schrieb umaswayam:

Hi,

We want to sort our records based on some sequence which is like
1 2 3 4 5 6 7 8 9  10 11 12 13 14.

I am using Websphere commerce to retrieve data using solr. When we are
customizing the sort order/ option in wc-search.xml file then we are getting
the sort order as
1 10 11 12 13 14 2 3 4 5 6 7 8 9 like this.

As I guess the sort order is checking with first digit of all sequences
based on that if they are same moving on to compare the next digit  so on,
which is resulting on wrong sort output.

Can anyone put some thoughts on this or help me out if I am doing something
wrong here.

Thanks in advance
Uma Shankar

Re: Query regarding solr custom sort order

2012-01-06 Thread Bernd Fehling


Hi Uma,

i don't understand what you're looking for.

Do you need to sort on fields of type double with precision 2 or what?

In your example you were talking about
1 2 3 4 5 6 7 8 9  10 11 12 13 14.

Regards,
Bernd


Am 06.01.2012 07:11, schrieb umaswayam:

Hi Bernd,

The column which comes from database is string only,  that is being default
populated. How do I convert it to double as the format is 1.00,2.00,3.00 in
database. So I need it to be coverted to double only.

Thanks,
Uma Shankar

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-regarding-solr-custom-sort-order-tp3631854p3637181.html
Sent from the Solr - User mailing list archive at Nabble.com.

exception while loading with DIH multi-threaded

2012-01-11 Thread Bernd Fehling


Hi list,

after changing DIH to multi-theaded (4 threads) I get sometimes an exception.
This is not always the case and I never had any problems with single-threaded 
at all.

I'm using Solr 3.5 but also tried branch_3x (3.6) and could see this with both 
versions.

Don't know why this comes up after changing to multi-threaded.
No other errors at all.

This is when LogUpdateProcessor finishes and is going create the log message.
Whats wrong with this code?
  public String getName(int idx) {
return (String)nvPairs.get(idx  1);
  }

Any idea how to trace this down?

...
11.01.2012 11:25:52 org.apache.solr.handler.dataimport.SolrWriter persist
INFO: Wrote last indexed time to 
/srv/www/solr/solr/solrserver/solr/./conf/dataimport.properties
11.01.2012 11:25:52 org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList 
cannot be cast to java.lang.String
at org.apache.solr.common.util.NamedList.getName(NamedList.java:127)
at org.apache.solr.common.util.NamedList.toString(NamedList.java:253)
at java.lang.String.valueOf(String.java:2826)
at java.lang.StringBuilder.append(StringBuilder.java:115)
at 
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:78)
at 
org.apache.solr.handler.dataimport.SolrWriter.finish(SolrWriter.java:133)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:213)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)

11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
11.01.2012 11:26:07 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={command=statusqt=/dataimport} 
status=0 QTime=0
11.01.2012 11:26:08 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
...

Regards
Bernd

Re: exception while loading with DIH multi-threaded

2012-01-11 Thread Bernd Fehling



After browsing through the issues it looks like something belonging to
https://issues.apache.org/jira/browse/SOLR-2694



Am 11.01.2012 14:08, schrieb Bernd Fehling:

Hi list,

after changing DIH to multi-theaded (4 threads) I get sometimes an exception.
This is not always the case and I never had any problems with single-threaded 
at all.

I'm using Solr 3.5 but also tried branch_3x (3.6) and could see this with both 
versions.

Don't know why this comes up after changing to multi-threaded.
No other errors at all.

This is when LogUpdateProcessor finishes and is going create the log message.
Whats wrong with this code?
public String getName(int idx) {
return (String)nvPairs.get(idx  1);
}

Any idea how to trace this down?

...
11.01.2012 11:25:52 org.apache.solr.handler.dataimport.SolrWriter persist
INFO: Wrote last indexed time to 
/srv/www/solr/solr/solrserver/solr/./conf/dataimport.properties
11.01.2012 11:25:52 org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList 
cannot be cast to java.lang.String
at org.apache.solr.common.util.NamedList.getName(NamedList.java:127)
at org.apache.solr.common.util.NamedList.toString(NamedList.java:253)
at java.lang.String.valueOf(String.java:2826)
at java.lang.StringBuilder.append(StringBuilder.java:115)
at 
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:78)
at org.apache.solr.handler.dataimport.SolrWriter.finish(SolrWriter.java:133)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:213)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)

11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
11.01.2012 11:26:07 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={command=statusqt=/dataimport} 
status=0 QTime=0
11.01.2012 11:26:08 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
...

Regards
Bernd



--
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

Re: exception while loading with DIH multi-threaded

2012-01-11 Thread Bernd Fehling


Hi Mikhail,

thanks for pointing me to the issue.

Regards, Bernd


Am 11.01.2012 21:47, schrieb Mikhail Khludnev:

FYI,
it's https://issues.apache.org/jira/browse/SOLR-2804
I'm trying to address it.

On Wed, Jan 11, 2012 at 5:49 PM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:



After browsing through the issues it looks like something belonging to
https://issues.apache.org/**jira/browse/SOLR-2694https://issues.apache.org/jira/browse/SOLR-2694



Am 11.01.2012 14:08, schrieb Bernd Fehling:

  Hi list,


after changing DIH to multi-theaded (4 threads) I get sometimes an
exception.
This is not always the case and I never had any problems with
single-threaded at all.

I'm using Solr 3.5 but also tried branch_3x (3.6) and could see this with
both versions.

Don't know why this comes up after changing to multi-threaded.
No other errors at all.

This is when LogUpdateProcessor finishes and is going create the log
message.
Whats wrong with this code?
public String getName(int idx) {
return (String)nvPairs.get(idx  1);
}

Any idea how to trace this down?

...
11.01.2012 11:25:52 org.apache.solr.handler.**dataimport.SolrWriter
persist
INFO: Wrote last indexed time to /srv/www/solr/solr/solrserver/**
solr/./conf/dataimport.**properties
11.01.2012 11:25:52 org.apache.solr.common.**SolrException log
SEVERE: Full Import failed:java.lang.**ClassCastException:
java.util.ArrayList cannot be cast to java.lang.String
at org.apache.solr.common.util.**NamedList.getName(NamedList.**java:127)
at org.apache.solr.common.util.**NamedList.toString(NamedList.**java:253)
at java.lang.String.valueOf(**String.java:2826)
at java.lang.StringBuilder.**append(StringBuilder.java:115)
at org.apache.solr.update.**processor.LogUpdateProcessor.**finish(**
LogUpdateProcessorFactory.**java:188)
at org.apache.solr.update.**processor.**UpdateRequestProcessor.finish(**
UpdateRequestProcessor.java:**78)
at org.apache.solr.handler.**dataimport.SolrWriter.finish(**
SolrWriter.java:133)
at org.apache.solr.handler.**dataimport.DocBuilder.execute(**
DocBuilder.java:213)
at org.apache.solr.handler.**dataimport.DataImporter.**
doFullImport(DataImporter.**java:359)
at org.apache.solr.handler.**dataimport.DataImporter.**
runCmd(DataImporter.java:427)
at org.apache.solr.handler.**dataimport.DataImporter$1.run(**
DataImporter.java:408)

11.01.2012 11:25:52 org.apache.solr.update.**DirectUpdateHandler2
rollback
INFO: start rollback
11.01.2012 11:25:52 org.apache.solr.update.**DirectUpdateHandler2
rollback
INFO: end_rollback
11.01.2012 11:26:07 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={command=statusqt=/**dataimport}
status=0 QTime=0
11.01.2012 11:26:08 org.apache.solr.update.**DirectUpdateHandler2 commit
INFO: start commit(optimize=true,**waitFlush=false,waitSearcher=**
true,expungeDeletes=false)
...

Regards
Bernd

Re: Synonym configuration not working?

2012-01-15 Thread Bernd Fehling


Yes and No.
If using Synonyms funtionality out of the box you have to do it at index time.

But if using it at query time, like we do, you have to do some programming.
We have connected a thesaurus which is actually using synonyms functionality at 
query time.
There are some pitfalls to take care of.

Bernd

Am 15.01.2012 07:07, schrieb Michael Lissner:

Just replying for others in the future. The answer to this is to do synonyms at 
index time, not at query time.

Mike

On Fri 06 Jan 2012 02:35:23 PM PST, Michael Lissner wrote:

I'm trying to set up some basic synonyms. The one I've been working on is:

us, usa, united states

My understanding is that adding that to the synonym file will allow users to 
search for US, and get back documents containing usa or united
states. Ditto for if a user puts in usa or united states.

Unfortunately, with this in place, when I do a search, I get the results for 
items that contain all three of the words - it's doing an AND of
the synonyms rather than an OR.

If I turn on debugging, this is indeed what I see (plus some stemming):
(+DisjunctionMaxQuery(((westCite:us westCite:usa westCite:unit) | (text:us 
text:usa text:unit) | (docketNumber:us docketNumber:usa
docketNumber:unit) | ((status:us status:usa status:unit)^1.25) | (court:us 
court:usa court:unit) | (lexisCite:us lexisCite:usa lexisCite:unit)
| ((caseNumber:us caseNumber:usa caseNumber:unit)^1.25) | ((caseName:us 
caseName:usa caseName:unit)^1.5/no_coord

Am I doing something wrong to cause this? My defaultOperator is set to AND, but 
I'd expect the synonym filter to understand that.

Any help?

Thanks,

Mike

SolrException with branch_3x

2012-01-31 Thread Bernd Fehling


On January 11th I downloaded branch_3x with svn into eclipse (indigo).
Compiled and tested it without problems.
Today I updated my branch_3x from repository.
Compiled fine but get now SolrException when starting.

Jan 31, 2012 1:50:15 PM org.apache.solr.core.SolrCore initListeners
INFO: [] Added SolrEventListener for firstSearcher: 
org.apache.solr.core.QuerySenderListener{queries=[{q=*:*,start=0,rows=10,spellcheck.build=true}, {q=(text:(*:*).

Jan 31, 2012 2:00:10 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: QueryResponseWriter init failure
at org.apache.solr.core.SolrCore.initWriters(SolrCore.java:1499)
at org.apache.solr.core.SolrCore.init(SolrCore.java:557)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:319)
...

It isn't able to init QueryResponseWriter on startup :-(
My config hasn't changed since 3 weeks ago.
Can't find any issue in CHANGES.txt belonging to this.


And something else to mention, in SolrCore.java initWriters at lines 1491 to 
1495:
if(info.isDefault()){
   defaultResponseWriter = writer;
   if(defaultResponseWriter != null)
 log.warn(Multiple default queryResponseWriter registered ignoring:  + 
old.getClass().getName());
}

This will also log.warn for the first defaultResponseWriter.
I would place defaultResponseWriter = writer; _AFTER_ the if/log.warn.


Regards,
Bernd

SOLVED: SolrException with branch_3x

2012-01-31 Thread Bernd Fehling


After changing the below suggested lines and compiling the branch_3x runs fine 
now.
SolrException is gone.

Regards,
Bernd

Am 31.01.2012 14:21, schrieb Bernd Fehling:

On January 11th I downloaded branch_3x with svn into eclipse (indigo).
Compiled and tested it without problems.
Today I updated my branch_3x from repository.
Compiled fine but get now SolrException when starting.

Jan 31, 2012 1:50:15 PM org.apache.solr.core.SolrCore initListeners
INFO: [] Added SolrEventListener for firstSearcher:
org.apache.solr.core.QuerySenderListener{queries=[{q=*:*,start=0,rows=10,spellcheck.build=true},
 {q=(text:(*:*).
Jan 31, 2012 2:00:10 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: QueryResponseWriter init failure
at org.apache.solr.core.SolrCore.initWriters(SolrCore.java:1499)
at org.apache.solr.core.SolrCore.init(SolrCore.java:557)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:319)
...

It isn't able to init QueryResponseWriter on startup :-(
My config hasn't changed since 3 weeks ago.
Can't find any issue in CHANGES.txt belonging to this.


And something else to mention, in SolrCore.java initWriters at lines 1491 to 
1495:
if(info.isDefault()){
defaultResponseWriter = writer;
if(defaultResponseWriter != null)
log.warn(Multiple default queryResponseWriter registered ignoring:  + 
old.getClass().getName());
}

This will also log.warn for the first defaultResponseWriter.
I would place defaultResponseWriter = writer; _AFTER_ the if/log.warn.


Regards,
Bernd

Re: usage of /etc/jetty.xml when debugging Solr in Eclipse

2012-02-08 Thread Bernd Fehling


Hi,

run-jetty-run issue #9:
...
In the VM Arguments of your launch configuration set
-Drjrxml=./jetty.xml

If jetty.xml is in the root of your project it will be used (you can also use a 
fully
qualified path name).

The UI port, context and WebApp dir are ignored, since you can define them in 
jetty.xml

Note: You still have to specify a valid WebApp dir because there are other 
checks
that the plugin performs.
...


Or you can start solr with jetty as usual and then connect eclipse
to the running process.


Regards


Am 08.02.2012 12:24, schrieb jmlucjav:

Hi,

I am following
http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
in order to be able to debug Solr in eclipse. I got it working fine.

Now, I usually use ./etc/jetty.xml to set logging configuration. When
starting jetty in eclipse I dont see any log files created, so I guessed
jetty.xml is not being used. So I added it to RunJetty Advanced
configuration (Additional jetty.xml), but in that case something goes wrong,
as I get a 'java.net.BindException: Address already in use: JVM_Bind' error,
like if something is started twice.

So my question is: can jetty.xml be used while debugging in eclipse? If so,
how? I would like to use the same configuration I use when I am just
changing xml stuff in Solr and starting with 'java -jar start.jar'.

thank in advance

Re: need to support bi-directional synonyms

2012-02-22 Thread Bernd Fehling


Use

sprayer, washer

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Regards
Bernd

Am 23.02.2012 07:03, schrieb remi tassing:

Same question here...

On Wednesday, February 22, 2012, geeky2gee...@hotmail.com  wrote:

hello all,

i need to support the following:

if the user enters sprayer in the desc field - then they get results for
BOTH sprayer and washer.

and in the other direction

if the user enters washer in the desc field - then they get results for
BOTH washer and sprayer.

would i set up my synonym file like this?

assuming expand = true..

sprayer =  washer
washer =  sprayer

thank you,
mark

--
View this message in context:

http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: [SoldCloud] leaking file descriptors

2012-03-01 Thread Bernd Fehling


What is netstat telling you about the connections on the servers?

Any connections in CLOSE_WAIT (passive close) hanging?

Saw this on my servers last week.
Used a little proggi to spoof a local connection on those servers ports
and was able to fake the TCP-stack to close those connections.
It also immediately released all open fd's set to DEL and cleaned
everything up without restarting.

Regards
Bernd


Am 01.03.2012 11:36, schrieb Markus Jelsma:

Hi,

Yesterday we had an issue with too many open files, which was solved because a 
username was misspelled. But there is still a problem with open
files.

We cannot succesfully index a few millions documents from MapReduce to a 5-node 
Solr cloud cluster. One of the problems is that after a while
ClassNotFoundErrors and other similar weirdness begin to appear. This will not 
solve itself if indexing is stopped.

With lsof i found that Solr keeps open roughly 9k files 8 hours after indexing 
failed. Out of the 9k there are roughly 7.5k deleted files that
still have a file descriptor open for the tomcat6 user, these are all segments 
files:

/opt/solr/openindex_a/data/index.20120228101550/_34s.tvd
java 10049 tomcat6 DEL REG 9,0 515607 
/opt/solr/openindex_a/data/index.20120228101550/_34s.tvx
java 10049 tomcat6 DEL REG 9,0 515504 
/opt/solr/openindex_a/data/index.20120228101550/_34s.fdx
java 10049 tomcat6 DEL REG 9,0 515735 
/opt/solr/openindex_a/data/index.20120228101550/_34s_nrm.cfs
java 10049 tomcat6 DEL REG 9,0 515595 
/opt/solr/openindex_a/data/index.20120228101550/_34v_nrm.cfs
java 10049 tomcat6 DEL REG 9,0 515592 
/opt/solr/openindex_a/data/index.20120228101550/_34v_0.tim
java 10049 tomcat6 DEL REG 9,0 515591 
/opt/solr/openindex_a/data/index.20120228101550/_34v_0.prx
java 10049 tomcat6 DEL REG 9,0 515590 
/opt/solr/openindex_a/data/index.20120228101550/_34v_0.frq
 any many more

Did i misconfigure anything? This is a pretty standard (no changes to 
IndexDefaults section) and a recent Solr trunk revision. Is there a bug
somewhere?

Thanks,
Markus

CLOSE_WAIT connections

2012-03-27 Thread Bernd Fehling


Hi list,

I have looked into the CLOSE_WAIT problem and created an issue with a patch to 
fix this.
A search for CLOSE_WAIT shows that there are many Apache projects hit by this 
problem.

https://issues.apache.org/jira/browse/SOLR-3280

Can someone recheck the patch (it belongs to SnapPuller) and give the OK for 
release?
The patch is against branch_3x (3.6).


Regards
Bernd

Re: [Announce] Solr 4.0 with RankingAlgorithm 1.4.1, NRT now supports both RankingAlgorithm and Lucene

2012-03-29 Thread Bernd Fehling


Nothing against RankingAlgorithm and your work, which sounds great, but
I think that YOUR Solr 4.0 might confuse some Solr users and/or newbees.
As far as I know the next official release will be 3.6.

So your Solr 4.0 is a trunk snapshot or what?

If so, which revision number?

Or have you done a fork and produced a stable Solr 4.0 of your own?

Regards
Bernd


Am 29.03.2012 15:49, schrieb Nagendra Nagarajayya:

I am very excited to announce the availability of Solr 4.0 with 
RankingAlgorithm 1.4.1 (NRT support) (build 2012-03-19). The NRT implementation
now supports both RankingAlgorithm and Lucene.

RankingAlgorithm 1.4.1 has improved performance over the earlier release (1.4) 
and supports the entire Lucene Query Syntax, ± and/or boolean
queries and is compatible with the new Lucene 4.0 api.

You can get more information about NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.1 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

Re: solr 3.5 taking long to index

2012-04-12 Thread Bernd Fehling


There were some changes in solrconfig.xml between solr3.1 and solr3.5.
Always read CHANGES.txt when switching to a new version.
Also helpful is comparing both versions of solrconfig.xml from the examples.

Are you sure you need a MaxPermSize of 5g?
Use jvisualvm to see what you really need.
This is also for all other JAVA_OPTS.



Am 11.04.2012 19:42, schrieb Rohit:
 We recently migrated from solr3.1 to solr3.5,  we have one master and one
 slave configured. The master has two cores,
 
  
 
 1) Core1 - 44555972 documents
 
 2) Core2 - 29419244 documents
 
  
 
 We commit every 5000 documents, but lately the commit is taking very long 15
 minutes plus in some cases. What could have caused this, I have checked the
 logs and the only warning i can see is,
 
  
 
 WARNING: Use of deprecated update request parameter update.processor
 detected. Please use the new parameter update.chain instead, as support for
 update.processor will be removed in a later version.
 
  
 
 Memory details:
 
  
 
 export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g
 
  
 
 Solr Config:
 
  
 
 useCompoundFilefalse/useCompoundFile
 
 mergeFactor10/mergeFactor
 
 ramBufferSizeMB32/ramBufferSizeMB
 
 !-- maxBufferedDocs1000/maxBufferedDocs --
 
   maxFieldLength1/maxFieldLength
 
   writeLockTimeout1000/writeLockTimeout
 
   commitLockTimeout1/commitLockTimeout
 
  
 
 What could be causing this, as everything was running fine a few days back?
 
  
 
  
 
 Regards,
 
 Rohit
 
 Mobile: +91-9901768202
 
 About Me:  http://about.me/rohitg http://about.me/rohitg

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling


You might have a look at:
http://www.basistech.com/lucene/


Am 12.04.2012 11:52, schrieb Michael Ludwig:
 Given an input of Windjacke (probably wind jacket in English), I'd
 like the code that prepares the data for the index (tokenizer etc) to
 understand that this is a Jacke (jacket) so that a query for Jacke
 would include the Windjacke document in its result set.
 
 It appears to me that such an analysis requires a dictionary-backed
 approach, which doesn't have to be perfect at all; a list of the most
 common 2000 words would probably do the job and fulfil a criterion of
 reasonable usefulness.
 
 Do you know of any implementation techniques or working implementations
 to do this kind of lexical analysis for German language data? (Or other
 languages, for that matter?) What are they, where can I find them?
 
 I'm sure there is something out (commercial or free) because I've seen
 lots of engines grokking German and the way it builds words.
 
 Failing that, what are the proper terms do refer to these techniques so
 you can search more successfully?
 
 Michael

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling

Paul,

nearly two years ago I requested an evaluation license and tested BASIS Tech
Rosette for Lucene  Solr. Was working excellent but the price much much to 
high.

Yes, they also have compound analysis for several languages including German.
Just configure your pipeline in solr and setup the processing pipeline in
Rosette Language Processing (RLP) and thats it.

Example from my very old schema.xml config:

fieldtype name=text_rlp class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory
rlpContext=solr/conf/rlp-index-context.xml
postPartOfSpeech=false
postLemma=true
postStem=true
postCompoundComponents=true/
 filter class=solr.LowerCaseFilterFactory/
 filter 
class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory
rlpContext=solr/conf/rlp-query-context.xml
postPartOfSpeech=false
postLemma=true
postCompoundComponents=true/
 filter class=solr.LowerCaseFilterFactory/
 filter 
class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldtype

So you just point tokenizer to RLP and have two RLP pipelines configured,
one for indexing (rlp-index-context.xml) and one for querying 
(rlp-query-context.xml).

Example form my rlp-index-context.xml config:

contextconfig
  properties
property name=com.basistech.rex.optimize value=false/
property name=com.basistech.ela.retokenize_for_rex value=true/
  /properties
  languageprocessors
languageprocessorUnicode Converter/languageprocessor
languageprocessorLanguage Identifier/languageprocessor
languageprocessorEncoding and Character Normalizer/languageprocessor
languageprocessorEuropean Language Analyzer/languageprocessor
!--languageprocessorScript Region Locator/languageprocessor
languageprocessorJapanese Language Analyzer/languageprocessor
languageprocessorChinese Language Analyzer/languageprocessor
languageprocessorKorean Language Analyzer/languageprocessor
languageprocessorSentence Breaker/languageprocessor
languageprocessorWord Breaker/languageprocessor
languageprocessorArabic Language Analyzer/languageprocessor
languageprocessorPersian Language Analyzer/languageprocessor
languageprocessorUrdu Language Analyzer/languageprocessor --
languageprocessorStopword Locator/languageprocessor
languageprocessorBase Noun Phrase Locator/languageprocessor
!--languageprocessorStatistical Entity Extractor/languageprocessor --
languageprocessorExact Match Entity Extractor/languageprocessor
languageprocessorPattern Match Entity Extractor/languageprocessor
languageprocessorEntity Redactor/languageprocessor
languageprocessorREXML Writer/languageprocessor
  /languageprocessors
/contextconfig

As you can see I used the European Language Analyzer.

Bernd



Am 12.04.2012 12:58, schrieb Paul Libbrecht:
 Bernd,
 
 can you please say a little more?
 I think this list is ok to contain some description for commercial solutions 
 that satisfy a request formulated on list.
 
 Is there any product at BASIS Tech that provides a compound-analyzer with a 
 big dictionary of decomposed compounds in German? 
 If yes, for which domain? 
 The Google Search result (I wonder if this is politically correct to not have 
 yours ;-)) shows me that there's an amount 
 of job done in this direction (e.g. Gärten to match Garten) but being precise 
 for this question would be more helpful!
 
 paul

HowTo getDefaultOperator with solr3.6?

2012-04-16 Thread Bernd Fehling


I'm trying to get the default operator of a schema in solr 3.6 but unfortunately
everything is deprecated.

The API solr 3.6 says:

getQueryParserDefaultOperator() - Method in class 
org.apache.solr.schema.IndexSchema
Deprecated.
use getSolrQueryParser().getDefaultOperator()

getSolrQueryParser(String) - Method in class org.apache.solr.schema.IndexSchema
Deprecated.


Now what?

How can I continue if I start with:
QueryParser.Operator operator = getReq().getSchema().

Regards
Bernd

Problems with edismax parser and solr3.6

2012-04-18 Thread Bernd Fehling


I just looked through my logs of solr 3.6 and saw several 0 hits which were 
not seen with solr 3.5.

While tracing this down it turned out that edismax don't like queries of type 
...q=(text:ide)... any more.

If parentheses around the query term the edismax fails with solr 3.6.

Can anyone confirm this and give me feedback?

Bernd

debugging junit test with eclipse

2012-04-24 Thread Bernd Fehling

I have tried all hints from internet for debugging a junit test of
solr 3.6 under eclipse but didn't succeed.

eclipse and everything is running, compiling, debugging with runjettyrun.
Tests have no errors.
Ant from command line ist also running with ivy, e.g.
ant -Dtestmethod=testUserFields -Dtestcase=TestExtendedDismaxParser 
test-solr-core

But I can't get a single test with junit running from eclipse and then
jump into it for debugging.

Any idea what's going wrong?

Regards
Bernd

Re: Multi-words synonyms matching

2012-05-15 Thread Bernd Fehling

 request

 fq = is Filter Query; generally used to restrict the super set of
 documents without influencing score (more info.
 http://wiki.apache.org/solr/**CommonQueryParameters#q
 http://wiki.apache.org/solr/CommonQueryParameters#q
 )

 For example:
 
 q=hotel de ville === returns 100 documents

 q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed
 ===
 returns 40 documents from super set of 100 documents


 hope this helps!

 - Jeevanandam



 On 24-04-2012 3:08 pm, elisabeth benoit wrote:

 Hello,

 I'd like to resume this post.

 The only way I found to do not split synonyms in words in
 synonyms.txt
 it
 to use the line

  filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory/

 in schema.xml

 where tokenizerFactory=solr.**KeywordTokenizerFactory

 instructs SynonymFilterFactory not to break synonyms into words on
 white
 spaces when parsing synonyms file.

 So now it works fine, mairie is mapped into hotel de ville and
 when I
 send request q=hotel de ville (quotes are mandatory to prevent
 analyzer
 to split hotel de ville on white spaces), I get answers with word
 mairie.

 But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de
 ville), it
 doesn't work!!!

 CATEGORY_ANALYZED is same field type as default search field. This
 means
 that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel
 de
 ville, solr uses the same analyzer, the one with the line

 filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory/.

 Anyone as a clue what is different between q analysis behaviour and
 fq
 analysis behaviour?

 Thanks a lot
 Elisabeth

 2012/4/12 elisabeth benoit elisaelisael...@gmail.com

  oh, that's right.

 thanks a lot,
 Elisabeth


 2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com

  Elisabeth -

 As you described, below mapping might suit for your need.
 mairie = hotel de ville, mairie

 mairie gets expanded to hotel de ville and mairie at index
 time.
  So
 mairie and hotel de ville searchable on document.

 However, still white space tokenizer splits at query time will be
 a
 problem as described by Markus.

 --Jeevanandam

 On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

 Have you tried the =' mapping instead? Something
 like
 hotel de ville = mairie
 might work for you.

 Yes, thanks, I've tried it but from what I undestand it doesn't
 solve
 my
 problem, since this means hotel de ville will be replace by
 mairie
 at
 index time (I use synonyms only at index time). So when user
 will
 ask
 hôtel de ville, it won't match.

 In fact, at index time I have mairie in my data, but I want user
 to be
 able
 to request mairie or hôtel de ville and have mairie as
 answer,
 and
 not
 have mairie as an answer when requesting hôtel.


 To map `mairie` to `hotel de ville` as single token you must
 escape
 your
 white
 space.

 mairie, hotel\ de\ ville

 This results in  a problem if your tokenizer splits on white
 space
 at
 query
 time.

 Ok, I guess this means I have a problem. No simple solution
 since
 at
 query
 time my tokenizer do split on white spaces.

 I guess my problem is more or less one of the problems
 discussed in



 http://lucene.472066.n3.**nabble.com/Multi-word-**
 synonyms-td3716292.html#**a3717215

 http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215



 Thanks a lot for your answers,
 Elisabeth





 2012/4/10 Erick Erickson erickerick...@gmail.com

 Have you tried the =' mapping instead? Something
 like
 hotel de ville = mairie
 might work for you.

 Best
 Erick

 On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
 Hello,

 I've read several post on this issue, but can't find a real
 solution
 to
 my
 multi-words synonyms matching problem.

 I have in my synonyms.txt an entry like

 mairie, hotel de ville

 and my index time analyzer is configured as followed for
 synonyms.

 filter class=solr.**SynonymFilterFactory
 synonyms=synonyms.txt
 ignoreCase=true expand=true/

 The problem I have is that now mairie matches with hotel
 and
 I
 would
 only want mairie to match with hotel de ville and
 mairie.

 When I look into the analyzer, I see that mairie is mapped
 into
 hotel,
 and words de ville are added in second and third position.
 To
 change
 that, I tried to do

 filter class=solr.**SynonymFilterFactory
 synonyms=synonyms.txt
 ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory/ (as I
 read in
 one
 post)

 and I can see now in the analyzer that mairie is mapped to
 hotel
 de
 ville, but now when I have query hotel de ville, it doesn't
 match
 at
 all
 with mairie.

 Anyone has a clue of what I'm doing wrong?

 I'm using Solr 3.4.

 Thanks,
 Elisabeth








 

-- 
*
Bernd FehlingUniversitätsbibliothek Bielefeld

Re: Out Of Memory =( Too many cores on one server?

2012-11-16 Thread Bernd Fehling

I guess you should give JVM more memory.

When starting to find a good value for -Xmx I oversized and  set
it to Xmx20G and Xms20G. Then I monitored the system and saw that JVM is
between 5G and 10G (java7 with G1 GC).
Now it is finally set to Xmx11G and Xms11G for my system with 1 core and 38 
million docs.
But JVM memory depends pretty much on number of fields in schema.xml
and fieldCache (sortable fields).

Regards
Bernd

Am 16.11.2012 09:29, schrieb stockii:
 Hello.
 
 if my server is running for a while i get some OOM Problems. I think the
 problem is, that i running to many cores on one Server with too many
 documents.
 
 this is my server concept:
 14 cores. 
 1 with 30 million docs
 1 with 22 million docs
 1 with growing 25 million docs
 1 with 67 million docs
 and the other cores are under 1 million docs.
 
 all these cores are running fine in one jetty and searching is very fast and
 we are satisfied with this.
 yesterday we got OOM. 
 
 Do you think that we should outsource the big cores into another virtual
 instance of the server? so that the JVM not share the memory and going OOM?
 starting with: MEMORY_OPTIONS=-Xmx6g -Xms2G -Xmn1G

Re: error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar

2012-11-18 Thread Bernd Fehling

I think there is already a BETA available:
http://luke.googlecode.com/svn/trunk/

Changes in unreleased version:
* Update to 4.0.0_BETA.
* Issue 22: term vectors could not be accessed if a field was not stored. Fixed
  also several other wrong assumptions about field flags.

You might try that one.

Regards
Bernd


Am 16.11.2012 17:16, schrieb Miguel Ángel Martín:
 hi all:
 
 i can open an index create with  solr 4.0. with luke version=
  lukeall-4.0.0-ALPHA.jar
 
 i have the error:
 
 Format version is not supported (resource:
 NIOFSIndexInput(path=/Users/desa/data/index/_2.tvx)): 1 (needs to be
 between 0 and 0)
  at
 org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:148)
 at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130)
  at
 org.apache.lucene.codecs.lucene40.Lucene40TermVectorsReader.init(Lucene40TermVectorsReader.java:108)
 at
 org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat.vectorsReader(Lucene40TermVectorsFormat.java:107)
  at
 org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:118)
 at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:55)
  at
 org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
 at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
  at
 org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
  at org.getopt.luke.Luke.openIndex(Luke.java:967)
 at org.getopt.luke.Luke.openOk(Luke.java:696)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at thinlet.Thinlet.invokeImpl(Thinlet.java:4579)
 at thinlet.Thinlet.invoke(Thinlet.java:4546)
  at thinlet.Thinlet.handleMouseEvent(Thinlet.java:3937)
 at thinlet.Thinlet.processEvent(Thinlet.java:2917)
  at java.awt.Component.dispatchEventImpl(Component.java:4744)
 at java.awt.Container.dispatchEventImpl(Container.java:2141)
  at java.awt.Component.dispatchEvent(Component.java:4572)
 at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4619)
  at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4280)
 at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4210)
  at java.awt.Container.dispatchEventImpl(Container.java:2127)
 at java.awt.Window.dispatchEventImpl(Window.java:2489)
  at java.awt.Component.dispatchEvent(Component.java:4572)
 at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:704)
  at java.awt.EventQueue.access$400(EventQueue.java:82)
 at java.awt.EventQueue$2.run(EventQueue.java:663)
  at java.awt.EventQueue$2.run(EventQueue.java:661)
 at java.security.AccessController.doPrivileged(Native Method)
  at
 java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
 at
 java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
  at java.awt.EventQueue$3.run(EventQueue.java:677)
 at java.awt.EventQueue$3.run(EventQueue.java:675)
  at java.security.AccessController.doPrivileged(Native Method)
 at
 java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
  at java.awt.EventQueue.dispatchEvent(EventQueue.java:674)
 at
 java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296)
  at
 java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211)
 at
 java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201)
  at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196)
 at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188)
  at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
 o
 
 
 any ideas?
 
 
 I,ve created another index with lucene 4.0 and this luke open the index
 well.
 
 thanks in advance
 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

Re: error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar

2012-11-19 Thread Bernd Fehling

I just downloaded, compiled and opened an optimized solr 4.0 index
in read only without problems.
Could browse through the docs, search with different analyzers, ...
Looks good.


Am 19.11.2012 08:49, schrieb Toke Eskildsen:
 On Mon, 2012-11-19 at 08:10 +0100, Bernd Fehling wrote:
 I think there is already a BETA available:
 http://luke.googlecode.com/svn/trunk/
 
 You might try that one.
 
 That doesn't work either for Lucene 4.0.0 indexes, same for source
 trunk. I did have some luck with downloading the source and changing the
 dependencies to Lucene 4.0.0 final (4 or 5 JARs, AFAIR). It threw a
 non-fatal exception upon index open, something about subReaders not
 being accessible throught the metod it used (sorry for being vague, it
 was on my home machine and some days ago), so I'm guessing that not all
 functionality works. It was possible to inspect some documents and that
 was what I needed at the time.

Re: Multi word synonyms

2012-11-29 Thread Bernd Fehling

There are also other solutions:

Multi-word synonym filter (synonym expansion)
https://issues.apache.org/jira/browse/LUCENE-4499

Since Solr 3.4 i have my own solution which might be obsolete if
LUCENE-4499 will be in a released version.
http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html


Am 29.11.2012 13:44, schrieb O. Klein:
 Found an article about the issue of  multi word synonyms
 http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/  .
 
 Not sure it's the solution I'm looking for, but it may be for someone else.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html
 Sent from the Solr - User mailing list archive at Nabble.com.

DefaultSolrParams ?

2012-11-30 Thread Bernd Fehling

Dear list,

after going from 3.6 to 4.0 I see exceptions in my logs.
It turned out that somehow the q-parameter was empty.
With 3.6 the q.alt in the solrconfig.xml worked as fallback but now with 4.0 
I get exceptions.

I use it like this:
SolrParams params = req.getParams();
String q = params.get(CommonParams.Q).trim();

The exception is from the second line if q is empty.
I can see q.alt=*:* in my defaults within params.

So why is it not picking up q.alt if q is empty?

Regards
Bernd

Re: DefaultSolrParams ?

2012-12-02 Thread Bernd Fehling

Hi Hoss,
my config has definately not changed and it worked with 3.6 and 3.6.1.
Yes I have a custom plugin and if q was empty with 3.6 it picked automatically 
q.alt from solrconfig.xml.
This all was done with params.get()
With 4.x this is gone due to some changes in DefaultSolrParams(?).
Which is now the method to get q from params and have an automatic fallback to 
q.alt?

Bernd

 
 : I use it like this:
 : SolrParams params = req.getParams();
 : String q = params.get(CommonParams.Q).trim();
 : 
 : The exception is from the second line if q is empty.
 : I can see q.alt=*:* in my defaults within params.
 : 
 : So why is it not picking up q.alt if q is empty?
 
 Youre talking about some sort of custom solr plugin that you 
 have correct?
 
 when you are accessing a SolrParams object, there is nothing 
 magic about 
 q and q.alt -- params.get() will only return the value 
 specified for 
 the param name you ask about.  The logic for using q.alt 
 (aka: 
 DisMaxParams.ALTQ) if q doesn't exist in the params (or is 
 blank) has 
 always been a specific feature of the DisMaxQParser.
 
 So if you are suddenly getting an NPE when q is missing, perhaps 
 the 
 problem is that in your old configs there was a default q 
 containing hte 
 empty string, and now that's gone?
 
 
 -Hoss

Re: OutOfMemoryError | While Faceting Query

2012-12-07 Thread Bernd Fehling

Hi Uwe,

sorting should be well prepared.
First rough check is fieldCache. You can see it with SolrAdmin Stats.
The insanity_count there should be 0 (zero).
Only sort on fields which are prepared for sorting and make sense to be sorted.

Do only faceting on fields which make sense. I've seen systems with faceting
on id, this is a no-go and doesn't make sense. The system pulls a lot of data
from the index into memory which can lead to OOME.

How to figure out what is killing your system?
There is no general rule but what you can do is:
- Start your test-system and make sure noone else is using it.
- Start a monitor for your JVM running SOLR (e.g. jvisualvm).
- use your search frontend and do searches, sorting, faceting of any combination
  possible and watch your heap memory if it has big jumps in memory heap.
- Analyze your search log files and look for searches which have a very high 
Qtime.
  Repeat the searches with high Qtime and see if you get insanity_counts or
  heap memory jumps in JVM.

Regards
Bernd


Am 06.12.2012 23:27, schrieb uwe72:
 Hi there,
 
 since i use a lot sorting and faceting i am getting very often an
 OutOfMemoryError.
 
 I have arround 
 6 million documents, 
 index size arround 18GB and using 
 tomcat with 1.8 GB max heap size.
 
 What can i do? What heap size is recommended in our case?
 
 Can i do other things in order to prevent OutOfMemoryError while using a lot
 of facets?
 
 Urgent, please help.
 
 Thanks,
 Uwe
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/OutOfMemoryError-While-Faceting-Query-tp4024947.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: jconsole over jmx - should threads be visible?

2012-12-19 Thread Bernd Fehling

Hi Shawn,

actually I use munin for monitoring but just checked with jvisualvm
which also runs fine for remote monitoring.

You might try the following:
http://www.codefactorycr.com/java-visualvm-to-profile-a-remote-server.html

You have to:
- generate a policy file on the server to be monitored
- start jstatd on the server to be monitored
- have jmx enabled for jetty or tomcat or ...
- should have jmx protected with password

password protection for jetty:
$JETTY_HOME/etc/jmxremote.access
$JETTY_HOME/etc/jmxremote.password

jmxremote.access looks like:
monitorRole readonly
controlRole readwrite

jmxremote.password looks like:
monitorRole solr4monitor
controlRole solr4control


If eveything is set correct then start jvisualvm, right click on Remote and
add a remote host. Enter IP address into Hostname and click on OK.
Now you have the connection to jstatd on the remote host which will show
jstatd and start.jar of remote host.
Then click on start.jar and you will be asked for username and password.
Enter controlRole as username and solr4control as password.

Regards
Bernd


Am 18.12.2012 18:21, schrieb Shawn Heisey:
 If I connect jconsole to a remote Solr installation (or any app) using jmx, 
 all the graphs are populated except 'threads' ... is this expected,
 or have I done something wrong?  I can't seem to locate the answer with 
 google.
 
 Thanks,
 Shawn

thanks for solr 4.1

2013-01-29 Thread Bernd Fehling

Now this must be said, thanks for solr 4.1 (and lucene 4.1)!

Great improvements compared to 4.0.

After building the first 4.1 index I thought the index was broken, but had no 
error messages anywhere.
Why I thought it was damaged?
The index size went down from 167 GB (solr 4.0) to 115 GB (solr 4.1)!!!

Will now move the new 4.1 index to testing stage and after it passes all 
testing it goes online.
Can't wait to see the new stats.

Regards,
Bernd

Solr4.1 changing result order FIFO to LIFO

2013-01-31 Thread Bernd Fehling

Hi list,

I recognized that the result order is FIFO if documents have the same score.
I think this is due to the fact that documents which are indexed later get a 
higher
internal document ID and the output for documents with the same score starts
with the lowest internal document ID and raises.
Is this right so far?

I would be pleased to get LIFO output. Documents with the same score but 
indexed later
are newer (as seen for my data) and should be displayed first.

Sure, I could use sorting, but sorting is always time consuming.
Whereas the output as LIFO is just starting with highest internal document ID 
first for
documents with the same score.

Is there anything like this already available?

If not, any hint where to look at (Lucene or Solr)?

Regards
Bernd

expert question about SolrReplication

2013-02-01 Thread Bernd Fehling

A question to the experts,

why is the replicated index copied from its temporary location (index.x)
to the real index directory and NOT moved?
Copying over 100s of gigs takes some time, moving is just changing the file 
system link.

Also, instead of first deleting the old index, why not
- moving the file links of old index to index.x.old
- moving the file links of new index to index
- and finally after new searcher is up, deleting index.x.old

Any answers?

Regards
Bernd

Re: expert question about SolrReplication

2013-02-03 Thread Bernd Fehling


Am 02.02.2013 03:48, schrieb Yonik Seeley:
 On Fri, Feb 1, 2013 at 4:13 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 A question to the experts,

 why is the replicated index copied from its temporary location 
 (index.x)
 to the real index directory and NOT moved?
 
 The intent is certainly to move and not copy (provided the Directory
 supports it).
 See StandardDirectoryFactory.move()


Because running Solr/Lucene with Linux I suppose it should really move,
but I will step through it with debugger and see what happens.



 
 Copying over 100s of gigs takes some time, moving is just changing the file 
 system link.

 Also, instead of first deleting the old index, why not
 - moving the file links of old index to index.x.old
 
 You can't do this in Windows?

Solr/Lucene is optimized for Windows???
Who is on the MS payroll?


 
 - moving the file links of new index to index
 - and finally after new searcher is up, deleting index.x.old

 
 -Yonik
 http://lucidworks.com

replication problems with solr4.1

2013-02-11 Thread Bernd Fehling

Hi list,

after upgrading from solr4.0 to solr4.1 and running it for two weeks now
it turns out that replication has problems and unpredictable results.
My installation is single index 41 mio. docs / 115 GB index size / 1 master / 3 
slaves.
- the master builds a new index from scratch once a week
- a replication is started manually with Solr admin GUI

What I see is one of these cases:
- after a replication a new searcher is opened on index.xxx directory and
  the old data/index/ directory is never deleted and besides the file
  replication.properties there is also a file index.properties
OR
- the replication takes place everything looks fine but when opening the admin 
GUI
  the statistics report
Last Modified: a day ago
Num Docs: 42262349
Max Doc:  42262349
Deleted Docs:  0
Version:  45174
Segment Count: 1

VersionGen  Size
Master: 1360483635404  112  116.5 GB
Slave:  1360483806741  113  116.5 GB


In the first case, why is the replication doing that???
It is an offline slave, no search activity, just there fore backup!


In the second case, why is the version and generation different right after
full replication?


Any thoughts on this?


- Bernd

Re: replication problems with solr4.1

2013-02-12 Thread Bernd Fehling


Now this is strange, the index generation and index version
is changing with replication.

e.g. master has index generation 118 index version 136059533234
and  slave  has index generation 118 index version 136059533234
are both same.

Now add one doc to master with commit.
master has index generation 119 index version 1360595446556

Next replicate master to slave. The result is:
master has index generation 119 index version 1360595446556
slave  has index generation 120 index version 1360595564333

I have not seen this before.
I thought replication is just taking over the index from master to slave,
more like a sync?




Am 11.02.2013 09:29, schrieb Bernd Fehling:
 Hi list,
 
 after upgrading from solr4.0 to solr4.1 and running it for two weeks now
 it turns out that replication has problems and unpredictable results.
 My installation is single index 41 mio. docs / 115 GB index size / 1 master / 
 3 slaves.
 - the master builds a new index from scratch once a week
 - a replication is started manually with Solr admin GUI
 
 What I see is one of these cases:
 - after a replication a new searcher is opened on index.xxx directory and
   the old data/index/ directory is never deleted and besides the file
   replication.properties there is also a file index.properties
 OR
 - the replication takes place everything looks fine but when opening the 
 admin GUI
   the statistics report
 Last Modified: a day ago
 Num Docs: 42262349
 Max Doc:  42262349
 Deleted Docs:  0
 Version:  45174
 Segment Count: 1
 
 VersionGen  Size
 Master: 1360483635404  112  116.5 GB
 Slave:1360483806741  113  116.5 GB
 
 
 In the first case, why is the replication doing that???
 It is an offline slave, no search activity, just there fore backup!
 
 
 In the second case, why is the version and generation different right after
 full replication?
 
 
 Any thoughts on this?
 
 
 - Bernd
 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

Re: replication problems with solr4.1

2013-02-13 Thread Bernd Fehling

OK then index generation and index version are out of count when it comes
to verify that master and slave index are in sync.

What else is possible?

The strange thing is if master is 2 or more generations ahead of slave then it 
works!
With your logic the slave must _always_ be one generation ahead of the master,
because the slave replicates from master and then does an additional commit
to recognize the changes on the slave.
This implies that the slave acts as follows:
- if the master is one generation ahaed then do an additional commit
- if the master is 2 or more generations ahead then do _no_ commit
OR
- if the master is 2 or more generations ahead then do a commit but don't
  change generation and version of index

Can this be true?

I would say not really.

Regards
Bernd


Am 13.02.2013 20:38, schrieb Amit Nithian:
 Okay so then that should explain the generation difference of 1 between the
 master and slave
 
 
 On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com wrote:
 

 On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote:

 doesn't it do a commit to force solr to recognize the changes?

 yes.

 - Mark

Re: Slaves always replicate entire index Index versions

2013-02-27 Thread Bernd Fehling

May be the info about index version is pulled from the repeaters
data/replication.properties file and the content of that file
is wrong.
Had something similar and only solution for me was deleting the
replication.properties file. But no guarantee about this.

Actually the replication is pretty much messed up in solr4.1.
Have seen about 6 or 7 erroneous combinations with replication that were
leading to some kind of problems. My problem is I can't reproduce it
continously to use a debugger :-(
Positive point is, if something goes wrong it goes wrong on all slaves.
Some kind of continuity :-)

While writing this I just found a new combination.
- master had clean index (everything committed, and optimized) and was
successfully replicated to all slaves
- master has added a few docs and committed
- master was replicated to all slaves
- all slaves have same generation and version as master
but:
- all slaves have now no index directory anymore. They only have an
index.x
directory and an additional index.properties file.

I already knew that something would go wrong when I started replication
and saw that the slaves pulled the whole index (again) from the master and
not only the files with added docs.

Under this circumstances with replication I would not even dream about using
SolrCloud.

Regards
Bernd

Am 27.02.2013 08:50, schrieb raulgrande83:
I'm now having a different problem. In my master-repeater-2slaves
architecture I have these generations versions:
Master: 29147
Repeater: 29147
Slaves: 29037

When I go to slaves logs it shows Slave in sync with master. That is
apparently because if I do
http://localhost:17045/solr/replication?command=indexversion (my repeater's
replication URL) the response is:
long name=generation29037/long

Why this URL is returning an old index version? Any solutions to this?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4043314.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how often do you boys restart your tomcat?

2011-07-27 Thread Bernd Fehling


Till now I used jetty and got 2 week as the longest uptime until OOM.
I just switched to tomcat6 and will see how that one behaves but
I think its not a problem of the servlet container.
Solr is pretty unstable if having a huge database.
Actually this can't be blamed directly to Solr it is a problem of
Lucene and its fieldCache. Somehow during 2 weeks runtime with searching
and replication the fieldCache gets doubled until OOM.

Currently there is no other solution to this than restarting your
tomcat or jetty regularly :-(


Am 27.07.2011 03:42, schrieb Bing Yu:

I find that, if I do not restart the master's tomcat for some days,
the load average will keep rising to a high level, solr become slow
and unstable, so I add a crontab to restart the tomcat everyday.

do you boys restart your tomcat ? and is there any way to avoid restart tomcat?

Re: how often do you boys restart your tomcat?

2011-07-27 Thread Bernd Fehling



It is definately Lucenes fieldCache making the trouble.
Restart your solr and monitor it with jvisualvm, especially OldGen heap.
When it gets to 100 percent filled use jmap to dump heap of your system.
Then use Eclipse Memory Analyzer http://www.eclipse.org/mat/ and
open the heap dump. You will see a pie chart and can easily identify
the largets consumer of your heap space.



Am 27.07.2011 09:02, schrieb Paul Libbrecht:

On curriki.org, our solr's Tomcat saturates memory after 2-4 weeks.
I am still investigating if I am accumulating something or something else is.

To check it, I am running a query all, return num results every minute to 
measure the time it takes. It's generally when it meets a big GC that gives a timeout 
that I start to worry. Memory then starts to be hogged but things get back to normal as 
soon as the GC is out.

I had other tomcat servers with very long uptimes (more than 6 months) so I do 
not think tomcat is guilty.

Currently I can only show the freememory of the system and what's in 
solr-stats, but I do not know what to look at really...

paul

Le 27 juil. 2011 à 03:42, Bing Yu a écrit :


I find that, if I do not restart the master's tomcat for some days,
the load average will keep rising to a high level, solr become slow
and unstable, so I add a crontab to restart the tomcat everyday.

do you boys restart your tomcat ? and is there any way to avoid restart tomcat?

segment.gen file is not replicated

2011-07-29 Thread Bernd Fehling


Dear list,

is there a deeper logic behind why the segment.gen file is not
replicated with solr 3.2?

Is it obsolete because I have a single segment?

Regards,
Bernd

Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread Bernd Fehling


Any JAVA_OPTS set?

Do not use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags.


Am 02.08.2011 12:01, schrieb alexander sulz:

Hello folks,

I'm using the latest stable Solr release - 3.3 and I encounter strange 
phenomena with it.
After about 19 hours it just crashes, but I can't find anything in the logs, no 
exceptions, no warnings,
no suspicious info entries..

I have an index-job running from 6am to 8pm every 10 minutes. After each job 
there is a commit.
An optimize-job is done twice a day at 12:15pm and 9:15pm.

Does anyone have an idea what could possibly be wrong or where to look for 
further debug info?

regards and thank you
alex

performance crossover between single index and sharding

2011-08-02 Thread Bernd Fehling


Is there any knowledge on this list about the performance
crossover between a single index and sharding and
when to change from a single index to sharding?

E.g. if index size is larger than 150GB and num of docs is
more than 25 mio. then it is better to change from single index
to sharding and have two shards.
Or something like this...

Sure, solr might even handle 50 mio. docs but performance is going down
and a sharded system with distributed search will be faster than
a single index, or not?

Is a single index always fast than sharding?

Regards
Bernd

Re: performance crossover between single index and sharding

2011-08-03 Thread Bernd Fehling



On 02.08.2011 21:00, Shawn Heisey wrote:

...
I did try some early tests with a single large index. Performance was pretty 
decent once it got warmed up, but I was worried about how it would
perform under a heavy load, and how it would cope with frequent updates. I 
never really got very far with testing those fears, because the full
rebuild time was unacceptable - at least 8 hours. The source database can keep 
up with six DIH instances reindexing at once, which completes
much quicker than a single machine grabbing the entire database. I may increase 
the number of shards after I remove virtualization, but I'll
need to fix a few limitations in my build system.
...


At first, thanks a lot to all answers and here is my setup.

I know that it is very difficult to give specific recommendations about this.
Because of changing from FAST Search to Solr I can state that Solr performs
very well, if not excellent.

To show that I compare apples and oranges here are my previous FAST Search 
setup:
- one master server (controlling, logging, search dispatcher)
- six index server (4.25 mio docs per server, 5 slices per index)
  (searching and indexing at the same time, indexing once per week during the 
weekend)
- each server has 4GB RAM, all servers are physical on seperate machines
- RAM usage controlled by the processes
- total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide
- index size is about 67GB per indexer -- about 402GB total
- about 3 qps at peek times
- with average search time of 0.05 seconds at peek times

And here is now my current Solr setup:
- one master server (indexing only)
- two slave server (search only) but only one is online, the second is fallback
- each server has 32GB RAM, all server are virtuell
  (master on a seperate physical machine, both slaves together on a physical 
machine)
- RAM usage is currently 20GB to java heap
- total of 31 mio. docs (all metadata) from 2000 databases worldwide
- index size is 156GB total
- search handler statistic report 0.6 average requests per second
- average time per request 39.5 (is that seconds?)
- building the index from scratch takes about 20 hours

The good thing is I have the ability to compare a commercial product and
enterprise system to open source.

I started with my simple Solr setup because of kiss (keep it simple and 
stupid).
Actually it is doing excellent as single index on a single virtuell server.
But the average time per request should be reduced now, thats why I started
this discussion.
While searches with smaller Solr index size (3 mio. docs) showed that it can
stand with FAST Search it now shows that its time to go with sharding.
I think we are already far behind the point of search performance crossover.

What I hope to get with sharding:
- reduce time for building the index
- reduce average time per request

What I fear with sharding:
- i currently have master/slave, do I then have e.g. 3 master and 3 slaves?
- the query changes because of sharding (is there a search distributor?)
- how to distribute the content the indexer with DIH on 3 server?
- anything else to think about while changing to sharding?

Conclusion:
- Solr can handle much more than 30 mio. docs of metadata in a single index
  if java heap size is large enough. Have an eye on Lucenes fieldCache and
  sorted fields, especially title (string) fields.
- The crossover in my case is somewhere between 3 mio. and 10 mio. docs
  per index for Solr (compared to FAST Search). FAST recommends about 3 to 6 
mio.
  docs per 4GB RAM server for their system.

Anyone able to reduce my fears about sharding?
Thanks again for all your answers.

Regards
Bernd

--
*
BASE - Bielefeld Academic Search Engine - www.base-search.net
*

Re: performance crossover between single index and sharding

2011-08-04 Thread Bernd Fehling


Hi Shawn,

the 0.05 seconds for search time at peek times (3 qps) is my target for Solr.
The numbers for solr are from Solr's statistic report page. So 39.5 seconds
average per request is definately to long and I have to change to sharding.

For FAST system the numbers for the search dispatcher are:
 0.042 sec elapsed per normal search, on avg.
 0.053 sec average uncached normal search time (last 100 queries).
 99.898% of searches using  1 sec
 99.999% of searches using  3 sec
 0.000% of all requests timed out
 22454567.577 sec time up (that is 259 days)

Is there a report page for those numbers for Solr?

About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx 
for Java.
Yesterday I noticed that we are running out of heap during replication so I 
have to
increase -Xmx to about 22g.

The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is still taking
most of the load. I plan to switch completely to Solr after sharding is up and
running stable. So there will be additional 3 qps to Solr at peek times.

I don't know if a controlling master like FAST makes any sense for Solr.
The small VMs with heartbeat and haproxy sounds great, must be on my todo list.

But the biggest problem currently is, how to configure the DIH to split up the
content to several indexer. Is there an indexing distributor?

Regards,
Bernd


Am 03.08.2011 16:33, schrieb Shawn Heisey:

Replies inline.

On 8/3/2011 2:24 AM, Bernd Fehling wrote:

To show that I compare apples and oranges here are my previous FAST Search 
setup:
- one master server (controlling, logging, search dispatcher)
- six index server (4.25 mio docs per server, 5 slices per index)
(searching and indexing at the same time, indexing once per week during the 
weekend)
- each server has 4GB RAM, all servers are physical on seperate machines
- RAM usage controlled by the processes
- total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide
- index size is about 67GB per indexer -- about 402GB total
- about 3 qps at peek times
- with average search time of 0.05 seconds at peek times


An average query time of 50 milliseconds isn't too bad. If the number from your 
Solr setup below (39.5) is the QTime, then Solr thinks it is
performing better, but Solr's QTime does not include absolutely everything that 
hs to happen. Do you by chance have 95th and 99th percentile
query times for either system?


And here is now my current Solr setup:
- one master server (indexing only)
- two slave server (search only) but only one is online, the second is fallback
- each server has 32GB RAM, all server are virtuell
(master on a seperate physical machine, both slaves together on a physical 
machine)
- RAM usage is currently 20GB to java heap
- total of 31 mio. docs (all metadata) from 2000 databases worldwide
- index size is 156GB total
- search handler statistic report 0.6 average requests per second
- average time per request 39.5 (is that seconds?)
- building the index from scratch takes about 20 hours


I can't tell whether you mean that each physical host has 32GB or each VM has 
32GB. You want to be sure that you are not oversubscribing your
memory. If you can get more memory in your machines, you really should. Do you 
know whether that 0.6 seconds is most of the delay that a user
sees when making a search request, or are there other things going on that 
contribute more delay? In our webapp, the Solr request time is
usually small compared with everything else the server and the user's browser 
are doing to render the results page. As much as I hate being the
tall pole in the tent, I look forward to the day when the developers can change 
that balance.


The good thing is I have the ability to compare a commercial product and
enterprise system to open source.

I started with my simple Solr setup because of kiss (keep it simple and 
stupid).
Actually it is doing excellent as single index on a single virtuell server.
But the average time per request should be reduced now, thats why I started
this discussion.
While searches with smaller Solr index size (3 mio. docs) showed that it can
stand with FAST Search it now shows that its time to go with sharding.
I think we are already far behind the point of search performance crossover.

What I hope to get with sharding:
- reduce time for building the index
- reduce average time per request


You will probably achieve both of these things by sharding, especially if you 
have a lot of CPU cores available. Like mine, your query volume is
very low, so the CPU cores are better utilized distributing the search.


What I fear with sharding:
- i currently have master/slave, do I then have e.g. 3 master and 3 slaves?
- the query changes because of sharding (is there a search distributor?)
- how to distribute the content the indexer with DIH on 3 server?
- anything else to think about while changing to sharding?


I

Re: segment.gen file is not replicated

2011-08-04 Thread Bernd Fehling



I have now updated to solr 3.3 but segment.gen is still not replicated.

Any idea why, is it a bug or a feature?
Should I write a jira issue for it?

Regards
Bernd

Am 29.07.2011 14:10, schrieb Bernd Fehling:

Dear list,

is there a deeper logic behind why the segment.gen file is not
replicated with solr 3.2?

Is it obsolete because I have a single segment?

Regards,
Bernd

Re: segment.gen file is not replicated

2011-08-04 Thread Bernd Fehling




Am 04.08.2011 12:52, schrieb Michael McCandless:

This file is actually optional; its there for redundancy in case the
filesystem is not reliable when listing a directory.  Ie, normally,
we list the directory to find the latest segments_N file; but if this
is wrong (eg the file system might have stale a cache) then we
fallback to reading the segments.gen file.

For example this is sometimes needed for NFS.

Likely replication is just skipping it?


That was my first idea. If not changed and touched then it will be skipped.

While being smart I deleted it on slave from index dir and then
replicated, but segment.gen was not replicated.
Due to your explanation NFS could not be reliable any more.

So my idea either a bug or a feature and the experts will know :-)

Regards
Bernd



Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:


I have now updated to solr 3.3 but segment.gen is still not replicated.

Any idea why, is it a bug or a feature?
Should I write a jira issue for it?

Regards
Bernd

Am 29.07.2011 14:10, schrieb Bernd Fehling:


Dear list,

is there a deeper logic behind why the segment.gen file is not
replicated with solr 3.2?

Is it obsolete because I have a single segment?

Regards,
Bernd

Re: performance crossover between single index and sharding

2011-08-04 Thread Bernd Fehling



java version 1.6.0_21
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

java: file format elf64-x86-64

Including the -d64 switch.


Am 04.08.2011 14:40, schrieb Bob Sandiford:

Dumb question time - you are using a 64 bit Java, and not a 32 bit Java?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com



-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
Sent: Thursday, August 04, 2011 2:39 AM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Hi Shawn,

the 0.05 seconds for search time at peek times (3 qps) is my target for
Solr.
The numbers for solr are from Solr's statistic report page. So 39.5
seconds
average per request is definately to long and I have to change to
sharding.

For FAST system the numbers for the search dispatcher are:
   0.042 sec elapsed per normal search, on avg.
   0.053 sec average uncached normal search time (last 100 queries).
   99.898% of searches using  1 sec
   99.999% of searches using  3 sec
   0.000% of all requests timed out
   22454567.577 sec time up (that is 259 days)

Is there a report page for those numbers for Solr?

About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM
are -Xmx for Java.
Yesterday I noticed that we are running out of heap during replication
so I have to
increase -Xmx to about 22g.

The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is still
taking
most of the load. I plan to switch completely to Solr after sharding is
up and
running stable. So there will be additional 3 qps to Solr at peek
times.

I don't know if a controlling master like FAST makes any sense for
Solr.
The small VMs with heartbeat and haproxy sounds great, must be on my
todo list.

But the biggest problem currently is, how to configure the DIH to split
up the
content to several indexer. Is there an indexing distributor?

Regards,
Bernd


Am 03.08.2011 16:33, schrieb Shawn Heisey:

Replies inline.

On 8/3/2011 2:24 AM, Bernd Fehling wrote:

To show that I compare apples and oranges here are my previous FAST

Search setup:

- one master server (controlling, logging, search dispatcher)
- six index server (4.25 mio docs per server, 5 slices per index)
(searching and indexing at the same time, indexing once per week

during the weekend)

- each server has 4GB RAM, all servers are physical on seperate

machines

- RAM usage controlled by the processes
- total of 25.5 mio. docs (mainly metadata) from 1500 databases

worldwide

- index size is about 67GB per indexer --  about 402GB total
- about 3 qps at peek times
- with average search time of 0.05 seconds at peek times


An average query time of 50 milliseconds isn't too bad. If the number

from your Solr setup below (39.5) is the QTime, then Solr thinks it is

performing better, but Solr's QTime does not include absolutely

everything that hs to happen. Do you by chance have 95th and 99th
percentile

query times for either system?


And here is now my current Solr setup:
- one master server (indexing only)
- two slave server (search only) but only one is online, the second

is fallback

- each server has 32GB RAM, all server are virtuell
(master on a seperate physical machine, both slaves together on a

physical machine)

- RAM usage is currently 20GB to java heap
- total of 31 mio. docs (all metadata) from 2000 databases worldwide
- index size is 156GB total
- search handler statistic report 0.6 average requests per second
- average time per request 39.5 (is that seconds?)
- building the index from scratch takes about 20 hours


I can't tell whether you mean that each physical host has 32GB or

each VM has 32GB. You want to be sure that you are not oversubscribing
your

memory. If you can get more memory in your machines, you really

should. Do you know whether that 0.6 seconds is most of the delay that
a user

sees when making a search request, or are there other things going on

that contribute more delay? In our webapp, the Solr request time is

usually small compared with everything else the server and the user's

browser are doing to render the results page. As much as I hate being
the

tall pole in the tent, I look forward to the day when the developers

can change that balance.



The good thing is I have the ability to compare a commercial product

and

enterprise system to open source.

I started with my simple Solr setup because of kiss (keep it

simple and stupid).

Actually it is doing excellent as single index on a single virtuell

server.

But the average time per request should be reduced now, thats why I

started

this discussion.
While searches with smaller Solr index size (3 mio. docs) showed

that it can

stand with FAST Search it now shows that its time to go

string cut-off filter?

2011-08-08 Thread Bernd Fehling


Hi list,

is there a string cut-off filter to limit the length
of a KeywordTokenized string?

So the string should not be dropped, only limitited to a
certain length.

Regards
Bernd

Re: string cut-off filter?

2011-08-09 Thread Bernd Fehling


Yes indeed I currently use a workaround with regex filter.

Example for limiting to 30 characters:
filter class=solr.PatternReplaceFilterFactory pattern=(.{1,30})(.{31,}) 
replacement=$1 replace=all/

Just thought there might be already a filter.
But as Karsten showed it is pretty easy to implement.

May be Karsten can open an issue and add his code?

Regards
Bernd

Am 08.08.2011 22:56, schrieb Markus Jelsma:

There is none indeed exept using copyField and maxChars. Could you perhaps
come up with some regex that replaces the group of chars beyond the desired
limit and replace it with '' ?

That would fit in a pattern replace char filter.


Hi Bernd,

I also searched for such a filter but did not found it.

Best regards
   Karsten

P.S. I am using now this filter:

public class CutMaxLengthFilter extends TokenFilter {

public CutMaxLengthFilter(TokenStream in) {
this(in, DEFAULT_MAXLENGTH);
}

public CutMaxLengthFilter(TokenStream in, int maxLength) {
super(in);
this.maxLength = maxLength;
}

public static final int DEFAULT_MAXLENGTH = 15;
private final int maxLength;
private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);

@Override
public final boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
}
int length = termAtt.length();
if (maxLength  0  length  maxLength) {
termAtt.setLength(maxLength);
}
return true;
}
}

with this factory

public class CutMaxLengthFilterFactory extends BaseTokenFilterFactory {

private int maxLength;

@Override
public void init(MapString, String  args) {
super.init(args);
maxLength = getInt(maxLength,

CutMaxLengthFilter.DEFAULT_MAXLENGTH);

}

public TokenStream create(TokenStream input) {
return new CutMaxLengthFilter(input, maxLength);
}
}



 Original-Nachricht 


Datum: Mon, 08 Aug 2011 10:15:45 +0200
Von: Bernd Fehlingbernd.fehl...@uni-bielefeld.de
An: solr-user@lucene.apache.org
Betreff: string cut-off filter?

Hi list,

is there a string cut-off filter to limit the length
of a KeywordTokenized string?

So the string should not be dropped, only limitited to a
certain length.

Regards
Bernd


--
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

question about query parsing

2011-08-09 Thread Bernd Fehling


Hi list,

while searching with debug on I see strange query parsing:

str name=rawquerystringidentifier:ub.uni-bielefeld.de/str
str name=querystringidentifier:ub.uni-bielefeld.de/str
str name=parsedquery
+MultiPhraseQuery(identifier:(ub.uni-bielefeld.de ub) uni bielefeld de)
/str
str name=parsedquery_toString
+identifier:(ub.uni-bielefeld.de ub) uni bielefeld de
/str


It is a PhraseQuery, but
- why is the string split apart?

- why is it grouped this way?



Default is edismax.

FIELD:
field name=identifier type=text_url indexed=true stored=false 
multiValued=true/

FIELDTYPE:
fieldType name=text_url class=solr.TextField positionIncrementGap=100
−
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 
catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/

filter class=solr.LowerCaseFilterFactory/
/analyzer
−
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 
catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/

filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType


Regards
Bernd

Re: Is optimize needed on slaves if it replicates from optimized master?

2011-08-10 Thread Bernd Fehling



From what I see on my slaves, yes.
After replication has finished and new index is in place and new reader has 
started
I have always a write.lock file in my index directory on slaves, even though 
the index
on master is optimized.

Regards
Bernd


Am 10.08.2011 09:12, schrieb Pranav Prakash:

Do slaves need a separate optimize command if they replicate from optimized
master?

*Pranav Prakash*

temet nosce

Twitterhttp://twitter.com/pranavprakash  | Bloghttp://blog.myblive.com  |
Googlehttp://www.google.com/profiles/pranny

Re: Is optimize needed on slaves if it replicates from optimized master?

2011-08-10 Thread Bernd Fehling



Sure there is actually no optimizing on the slave needed,
but after calling optimize on the slave the write.lock will be removed.
So why is the replication process not doing this?

Regards
Bernd


Am 10.08.2011 10:57, schrieb Shalin Shekhar Mangar:

On Wed, Aug 10, 2011 at 1:11 PM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:



 From what I see on my slaves, yes.
After replication has finished and new index is in place and new reader has
started
I have always a write.lock file in my index directory on slaves, even
though the index
on master is optimized.



That is not true. Replication is roughly a copy of the diff between the
master and the slave's index. An optimized index is a merged and re-written
index so replication from an optimized master will give an optimized copy on
the slave.

The write lock is due to the fact that an IndexWriter is always open in Solr
even on the slaves.

Re: Solr 3.3 crashes after ~18 hours?

2011-08-11 Thread Bernd Fehling


Hi, googling hotspot server 19.1-b02 shows that you are not alone
with hanging threads and crashes. And not only with solr.
Maybe try another JAVA?

Bernd



Am 10.08.2011 17:00, schrieb alexander sulz:

Okay, with this command it hangs.
Also: I managed to get a Thread Dump (attached).

regards

Am 05.08.2011 15:08, schrieb Yonik Seeley:

On Fri, Aug 5, 2011 at 7:33 AM, alexander sulza.s...@digiconcept.net wrote:

Usually you get a XML-Response when doing commits or optimize, in this case
I get nothing
in return, but the site ( http://[...]/solr/update?optimize=true ) DOESN'T
load forever or anything.
It doesn't hang! I just get a blank page / empty response.

Sounds like you are doing it from a browser?
Can you try it from the command line? It should give back some sort
of response (or hang waiting for a response).

curl http://localhost:8983/solr/update?commit=true;

-Yonik
http://www.lucidimagination.com



I use the stuff in the example folder, the only changes i made was enable
logging and changing the port to 8985.
I'll try getting a thread dump if it happens again!
So far its looking good with having allocated more memory to it.

Am 04.08.2011 16:08, schrieb Yonik Seeley:

On Thu, Aug 4, 2011 at 8:09 AM, alexander sulza.s...@digiconcept.net
wrote:

Thank you for the many replies!

Like I said, I couldn't find anything in logs created by solr.
I just had a look at the /var/logs/messages and there wasn't anything
either.

What I mean by crash is that the process is still there and http GET
pings
would return 200
but when i try visiting /solr/admin, I'd get a blank page! The server
ignores any incoming updates or commits,

ignores means what? The request hangs? If so, could you get a thread
dump?

Do queries work (like /solr/select?q=*:*) ?


thous throwing no errors, no 503's.. It's like the server has a blackout
and
stares blankly into space.

Are you using a different servlet container than what is shipped with
solr?
If you did start with the solr example server, what jetty
configuration changes have you made?

-Yonik
http://www.lucidimagination.com






--
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

sorting issue with solr 3.3

2011-08-12 Thread Bernd Fehling


It turned out that there is a sorting issue with solr 3.3.
As fas as I could trace it down currently:

4 docs in the index and a search for *:*

sorting on field dccreator_sort in descending order

http://localhost:8983/solr/select?fsv=truesort=dccreator_sort%20descindent=onversion=2.2q=*%3A*start=0rows=10fl=dccreator_sort

result is:
--
lst name=sort_values
arr name=dccreator_sort
strconvertitovistitutonazionaled/str
str莊國鴻chuangkuohung/str
strzyywwwxxx/str
strabdelhadiyasserabdelfattah/str
/arr
/lst

fieldType:
--
fieldType name=alphaOnlySortLim class=solr.TextField sortMissingLast=true 
omitNorms=true
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.TrimFilterFactory /
filter class=solr.PatternReplaceFilterFactory 
pattern=([\x20-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]) replacement= replace=all/
filter class=solr.PatternReplaceFilterFactory pattern=(.{1,30})(.{31,}) 
replacement=$1 replace=all/
  /analyzer
/fieldType

field:
--
field name=dccreator_sort type=alphaOnlySortLim indexed=true stored=true 
/


According to documentation the sorting is UTF8 but _why_ is the first string
at position 1 and _not_ at position 3 as it should be?


Following sorting through the code is somewhat difficult.
Any hint where to look for or where to start debugging?

Regards
Bernd

Re: sorting issue with solr 3.3

2011-08-13 Thread Bernd Fehling

The issue was located in a 31 million docs index and i have already reduced it
to a reproducable 4 documents index. It is stock solr 3.3.0.
Yes, the documents are also in the wrong order as the field sort values.
Just added only the field sort values to the email to keep it short.
I will produce a test on Monday when I'm back in my office.
Hang on...

Regards
Bernd
http://www.base-search.net/

 I've checked in an improved TestSort that adds deleted docs and
 randomizes things a lot more (and fixes the previous reliance on doc
 ids not being reordered).
 I still can't reproduce this error though.
 Is this stock solr?  Can you verify that the documents are 
 in the
 wrong order also (and not just the field sort values)?
 
 -Yonik
 http://www.lucidimagination.com

Re: sorting issue with solr 3.3

2011-08-15 Thread Bernd Fehling


I have created an issue with test attached.

https://issues.apache.org/jira/browse/SOLR-2713

Will try to figure out whats going wrong.

Regards
Bernd
http://www.base-search.net/


Am 13.08.2011 16:20, schrieb Bernd Fehling:

The issue was located in a 31 million docs index and i have already reduced it
to a reproducable 4 documents index. It is stock solr 3.3.0.
Yes, the documents are also in the wrong order as the field sort values.
Just added only the field sort values to the email to keep it short.
I will produce a test on Monday when I'm back in my office.
Hang on...

Regards
Bernd
http://www.base-search.net/


I've checked in an improved TestSort that adds deleted docs and
randomizes things a lot more (and fixes the previous reliance on doc
ids not being reordered).
I still can't reproduce this error though.
Is this stock solr?  Can you verify that the documents are
in the
wrong order also (and not just the field sort values)?

-Yonik
http://www.lucidimagination.com

commit to jira and change Status and Resolution

2011-09-01 Thread Bernd Fehling


Hi list,

I have fixed an issue and created a patch (SOLR-2726) but how to
change Status and Resolution in jira?

And how to commit this, any idea?

Regards,
Bernd

Re: Unable to generate trace

2011-09-08 Thread Bernd Fehling



How about using jmap or jvisualvm?

Or even connecting with eclipse to the process for live analysis?


Am 08.09.2011 11:07, schrieb Rohit:

Nope not getting anything here also.

Regards,
Rohit

-Original Message-
From: Jerry Li [mailto:zongjie...@gmail.com]
Sent: 08 September 2011 08:09
To: solr-user@lucene.apache.org
Subject: Re: Unable to generate trace

what about kill -3 PID command?

On Thu, Sep 8, 2011 at 4:06 PM, Rohitro...@in-rev.com  wrote:

Hi,



I am running solr in tomcat on a linux machine, my solr hangs after about 40
hrs, I wanted to generate the dump and analyse the logs. But the command
kill -QUIT PID doesn't seem to be doing anything.



How can I generate a dump otherwise to see, why solr hangs?

skipping parts of query analysis for some queries

2011-09-30 Thread Bernd Fehling


I'm in the need of skipping some query analysis steps for some
queries. Or more precisely, make it switchable with a query
parameter.

Use case:
fieldType name=text_spec class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=true
  analyzer type=index
charFilter class=solr.MappingCharFilterFactory 
mapping=mapping-FoldToASCII.txt/
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 
catenateAll=0 splitOnCaseChange=0 splitOnNumerics=1/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
charFilter class=solr.MappingCharFilterFactory 
mapping=mapping-FoldToASCII.txt/
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 
catenateAll=0 splitOnCaseChange=0 splitOnNumerics=1/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=false 
outputUnigramsIfNoShingles=true/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
format=solr
  tokenizerFactory=solr.KeywordTokenizerFactory 
ignoreCase=true expand=true/
  /analyzer
/fieldType

For some queries I want to skip SynonymFilterFactory with or without 
ShingleFilterFactory.
First I thought of a second field with a seperate fieldType, but why stuffing 
content twice in the index?
So I had the idea to make things switchable with query parameter.
E.g. for SynonymFilterFactory class there will we two optional attributes,
querycontrol=true/false (default=false)
queryparam=sff. (default=sff)

With query ...sff=true... it will use SynonymFilterFactory
with query ...sff=false... it will do nothing in SynonymFilterFactory.

Easy to implement but this is only for SynonymFilterFactory.
What if I want to swith of other filters with my query?
Should I patch all FilterFactories?

Next idea. How about to modify the analyzer?
analyzer type=query
  charFilter...
  tokenizer...
  filter...
  optional switch=foo
filter...
filter...
  /optional
/analyzer

Now with query ...foo=true... it will use the filters enclosed by the 
optional tag,
with query ...foo=false... they are skipped.

Advantage:
- more flexibility
- no need to index content twice or more times if only changes in query analysis
  makes the difference


Any opinions?

Regards,
Bernd

accessing the query string from inside TokenFilter

2011-10-25 Thread Bernd Fehling


Dear list,
while writing some TokenFilter for my analyzer chain I need access to
the query string from inside of my TokenFilter for some comparison, but the
Filters are working with a TokenStream and get seperate Tokens.
Currently I couldn't get any access to the query string.

Any idea how to get this done?

Is there an Attribute for query or qstr?

Regards Bernd

Report about Solr and multilingual Thesaurus

2011-11-21 Thread Bernd Fehling


Dear list,

just in case you are planning to integrate or combine a thesaurus with Solr
the following report might help you.

BASE - Solr and the multilingual EuroVoc Thesaurus
http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html

In brief:
It explains how a working solution is possible to integrate/combine the 
multilingual
EuroVoc Thesaurus with Solr. It is used as query time search term expansion.
Covering over 22 languages this gives you the ability to also find documents in
other languages than the original query and also expand the query with synonyms.

Best regards
Bernd

Re: cache monitoring tools?

2011-12-08 Thread Bernd Fehling


Hi Otis,

I can't find the download for the free SPM.
What Hardware and OS do I need for installing SPM to monitor my servers?

Regards
Bernd

Am 07.12.2011 18:47, schrieb Otis Gospodnetic:

Hi Dmitry,

You should use SPM for Solr - it exposes all Solr metrics and more (JVM, system 
info, etc.)
PLUS it's currently 100% free.

http://sematext.com/spm/solr-performance-monitoring/index.html


We use it with our clients on a regular basis and it helps us a TON - we just 
helped a very popular mobile app company improve Solr performance by a few 
orders of magnitude (including filter tuning) with the help of SPM.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: Dmitry Kandmitry@gmail.com
To: solr-user@lucene.apache.org
Sent: Wednesday, December 7, 2011 2:13 AM
Subject: cache monitoring tools?

Hello list,

We've noticed quite huge strain on the filterCache in facet queries against
trigram fields (see schema in the end of this e-mail). The typical query
contains some keywords in the q parameter and boolean filter query on other
solr fields. It is also facet query, the facet field is of
type shingle_text_trigram (see schema) and facet.limit=50.


Questions: are there some tools (except for solrmeter) and/or approaches to
monitor / profile the load on caches, which would help to derive better
tuning parameters?

Can you recommend checking config parameters of other components but caches?

BTW, this has become much faster compared to solr 1.4 where we had to a lot
of optimizations on schema level (e.g. by making a number of stored fields
non-stored)

Here are the relevant stats from admin (SOLR 3.4):

description: Concurrent LRU Cache(maxSize=1, initialSize=10,
minSize=9000, acceptableSize=9500, cleanupThread=false)
stats: lookups : 93
hits : 90
hitratio : 0.96
inserts : 1
evictions : 0
size : 1
warmupTime : 0
cumulative_lookups : 93
cumulative_hits : 90
cumulative_hitratio : 0.96
cumulative_inserts : 1
cumulative_evictions : 0
item_shingleContent_trigram :
{field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=222924,phase1=221106,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=91}
name: filterCache
class: org.apache.solr.search.FastLRUCache
version: 1.0
description: Concurrent LRU Cache(maxSize=512, initialSize=512,
minSize=460, acceptableSize=486, cleanupThread=false)
stats: lookups : 1003486
hits : 2809
hitratio : 0.00
inserts : 1000694
evictions : 1000221
size : 473
warmupTime : 0
cumulative_lookups : 1003486
cumulative_hits : 2809
cumulative_hitratio : 0.00
cumulative_inserts : 1000694
cumulative_evictions : 1000221


schema excerpt:

fieldType name=shingle_text_trigram class=solr.TextField
positionIncrementGap=100
analyzer
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.ShingleFilterFactory maxShingleSize=3
outputUnigrams=true/
/analyzer
/fieldType

--
Regards,

Dmitry Kan

KStemmer for Solr

2010-10-11 Thread Bernd Fehling


Because I'm using solr from trunk and not from lucid imagination
I was missing KStemmer. So I decided to add this stemmer to my installation.

After some modifications KStemmer is now working fine as stand-alone.
Now I have a KStemmerFilter.
Next will be to write the KStemmerFilterFactory.

I would place the Factory in 
lucene-solr/solr/src/java/org/apache/solr/analysis/
to the other Factories, but where to place the Filter?

Does it make sense to place the Filter somewhere under
lucene-solr/modules/analysis/common/src/java/org/apache/lucene/analysis/ ?
But this is for Lucene and not Solr...

Or should I place the Filter in a subdirectory of the Factories?

Any suggestion for me?

Regards,
Bernd

DIH delta-import question

2010-10-15 Thread Bernd Fehling

Dear list,

I'm trying to delta-import with datasource FileDataSource and
processor FileListEntityProcessor. I want to load only files
which are newer than dataimport.properties - last_index_time.
It looks like that newerThan=${dataimport.last_index_time} is
without any function.

Can it be that newerThan is configured under FileListEntityProcessor
but used for the next following entity processor and not for
FileListEntityProcessor itself?

This is in my case the XPathEntityProcessor which doesn't support
newerThan.
Version is solr 4.0 from trunk.

Regards,
Bernd

Re: How to use polish stemmer - Stempel - in schema.xml?

2010-10-28 Thread Bernd Fehling

Hi Jakub,

I have ported the KStemmer for use in most recent Solr trunk version.
My stemmer is located in the lib directory of Solr solr/lib/KStemmer-2.00.jar
because it belongs to Solr.

Write it as FilterFactory and use it as Filter like:
filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory 
protected=protwords.txt /

This is how my fieldType looks like:

fieldType name=text_kstem class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=false /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory 
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0
catenateAll=0 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory 
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType

Regards,
Bernd



Am 28.10.2010 14:56, schrieb Jakub Godawa:
 Hi!
 There is a polish stemmer http://www.getopt.org/stempel/ and I have
 problems connecting it with solr 1.4.1
 Questions:
 
 1. Where EXACTLY do I put stemper-1.0.jar file?
 2. How do I register the file, so I can build a fieldType like:
 
 fieldType name=text_pl class=solr.TextField
   analyzer class=org.geoopt.solr.analysis.StempelTokenFilterFactory/
 /fieldType
 
 3. Is that the right approach to make it work?
 
 Thanks for verbose explanation,
 Jakub.

Re: How to use polish stemmer - Stempel - in schema.xml?

2010-11-02 Thread Bernd Fehling

Hi Jakub,

if you unzip your stempel-1.0.jar do you have the
required directory structure and file in there?
org/getopt/stempel/lucene/StempelFilter.class

Regards,
Bernd

Am 02.11.2010 13:54, schrieb Jakub Godawa:
 Erick I've put the jar files like that before. I also added the
 directive and put the file in instanceDir/lib
 
 What is still a problem is that even the files are loaded:
 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader replaceClassLoader
 INFO: Adding 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar'
 to classloader
 
 I am not able to use the FilterFactory... maybe I am attempting it in
 a wrong way?
 
 Cheers,
 Jakub Godawa.
 
 2010/11/2 Erick Erickson erickerick...@gmail.com:
 The polish stemmer jar file needs to be findable by Solr, if you copy
 it to solr_home/lib and restart solr you should be set.

 Alternatively, you can add another lib directive to the solrconfig.xml
 file
 (there are several examples in that file already).

 I'm a little confused about not being able to find TokenFilter, is that
 still
 a problem?

 HTH
 Erick

 On Tue, Nov 2, 2010 at 8:07 AM, Jakub Godawa jakub.god...@gmail.com wrote:

 Thank you Bernd! I couldn't make it run though. Here is my problem:

 1. There is a file ~/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar
 2. In ~/apache-solr-1.4.1/ifaq/solr/conf/solrconfig.xml there is a
 directive: lib path=../lib/stempel-1.0.jar /
 3. In ~/apache-solr-1.4.1/ifaq/solr/conf/schema.xml there is fieldType:

 (...)
  !-- Polish --
   fieldType name=text_pl class=solr.TextField
analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=org.getopt.stempel.lucene.StempelFilter /
  !--filter
 class=org.getopt.solr.analysis.StempelTokenFilterFactory
 protected=protwords.txt / --
/analyzer
  /fieldType
 (...)

 4. jar file is loaded but I got an error:
 SEVERE: Could not start SOLR. Check solr/home property
 java.lang.NoClassDefFoundError: org/apache/lucene/analysis/TokenFilter
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
  at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 (...)

 5. Different class gave me that one:
 SEVERE: org.apache.solr.common.SolrException: Error loading class
 'org.getopt.solr.analysis.StempelTokenFilterFactory'
  at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)
  at
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:390)
 (...)

 Question is: How to make fieldType / and filter / work with that
 Stempel? :)

 Cheers,
 Jakub Godawa.

 2010/10/29 Bernd Fehling bernd.fehl...@uni-bielefeld.de:
 Hi Jakub,

 I have ported the KStemmer for use in most recent Solr trunk version.
 My stemmer is located in the lib directory of Solr
 solr/lib/KStemmer-2.00.jar
 because it belongs to Solr.

 Write it as FilterFactory and use it as Filter like:
 filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory
 protected=protwords.txt /

 This is how my fieldType looks like:

fieldType name=text_kstem class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=false /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory
 protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory
 protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType

 Regards,
 Bernd



 Am 28.10.2010 14:56, schrieb Jakub Godawa:
 Hi!
 There is a polish stemmer http://www.getopt.org/stempel/ and I have
 problems connecting it with solr 1.4.1
 Questions:

 1. Where EXACTLY do I put stemper-1.0.jar file?
 2. How do I register the file, so I can build a fieldType like:

 fieldType name=text_pl class=solr.TextField
   analyzer class=org.geoopt.solr.analysis.StempelTokenFilterFactory/
 /fieldType

 3. Is that the right approach to make it work?

 Thanks for verbose explanation,
 Jakub

Re: How to use polish stemmer - Stempel - in schema.xml?

2010-11-02 Thread Bernd Fehling


So you call org.getopt.solr.analysis.StempelTokenFilterFactory.
In this case I would assume a file StempelTokenFilterFactory.class
in your directory org/getopt/solr/analysis/.

And a class which extends the BaseTokenFilterFactory rigth?
...
public class StempelTokenFilterFactory extends BaseTokenFilterFactory 
implements ResourceLoaderAware {
...



Am 02.11.2010 14:20, schrieb Jakub Godawa:
 This is what stempel-1.0.jar consist of after jar -xf:
 
 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R org/
 org/:
 egothor  getopt
 
 org/egothor:
 stemmer
 
 org/egothor/stemmer:
 Cell.class Diff.classGener.class  MultiTrie2.class
 Optimizer2.class  Reduce.classRow.classTestAll.class
 TestLoad.class  Trie$StrEnum.class
 Compile.class  DiffIt.class  Lift.class   MultiTrie.class
 Optimizer.class   Reduce$Remap.class  Stock.class  Test.class
 Trie.class
 
 org/getopt:
 stempel
 
 org/getopt/stempel:
 Benchmark.class  lucene  Stemmer.class
 
 org/getopt/stempel/lucene:
 StempelAnalyzer.class  StempelFilter.class
 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R META-INF/
 META-INF/:
 MANIFEST.MF
 jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R res
 res:
 tables
 
 res/tables:
 readme.txt  stemmer_1000.out  stemmer_100.out  stemmer_2000.out
 stemmer_200.out  stemmer_500.out  stemmer_700.out
 
 2010/11/2 Bernd Fehling bernd.fehl...@uni-bielefeld.de:
 Hi Jakub,

 if you unzip your stempel-1.0.jar do you have the
 required directory structure and file in there?
 org/getopt/stempel/lucene/StempelFilter.class

 Regards,
 Bernd

 Am 02.11.2010 13:54, schrieb Jakub Godawa:
 Erick I've put the jar files like that before. I also added the
 directive and put the file in instanceDir/lib

 What is still a problem is that even the files are loaded:
 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 INFO: Adding 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar'
 to classloader

 I am not able to use the FilterFactory... maybe I am attempting it in
 a wrong way?

 Cheers,
 Jakub Godawa.

 2010/11/2 Erick Erickson erickerick...@gmail.com:
 The polish stemmer jar file needs to be findable by Solr, if you copy
 it to solr_home/lib and restart solr you should be set.

 Alternatively, you can add another lib directive to the solrconfig.xml
 file
 (there are several examples in that file already).

 I'm a little confused about not being able to find TokenFilter, is that
 still
 a problem?

 HTH
 Erick

 On Tue, Nov 2, 2010 at 8:07 AM, Jakub Godawa jakub.god...@gmail.com 
 wrote:

 Thank you Bernd! I couldn't make it run though. Here is my problem:

 1. There is a file ~/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar
 2. In ~/apache-solr-1.4.1/ifaq/solr/conf/solrconfig.xml there is a
 directive: lib path=../lib/stempel-1.0.jar /
 3. In ~/apache-solr-1.4.1/ifaq/solr/conf/schema.xml there is fieldType:

 (...)
  !-- Polish --
   fieldType name=text_pl class=solr.TextField
analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=org.getopt.stempel.lucene.StempelFilter /
  !--filter
 class=org.getopt.solr.analysis.StempelTokenFilterFactory
 protected=protwords.txt / --
/analyzer
  /fieldType
 (...)

 4. jar file is loaded but I got an error:
 SEVERE: Could not start SOLR. Check solr/home property
 java.lang.NoClassDefFoundError: org/apache/lucene/analysis/TokenFilter
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
  at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 (...)

 5. Different class gave me that one:
 SEVERE: org.apache.solr.common.SolrException: Error loading class
 'org.getopt.solr.analysis.StempelTokenFilterFactory'
  at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)
  at
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:390)
 (...)

 Question is: How to make fieldType / and filter / work with that
 Stempel? :)

 Cheers,
 Jakub Godawa.

 2010/10/29 Bernd Fehling bernd.fehl...@uni-bielefeld.de:
 Hi Jakub,

 I have ported the KStemmer for use in most recent Solr trunk version.
 My stemmer is located in the lib directory of Solr
 solr/lib/KStemmer-2.00.jar
 because it belongs to Solr.

 Write it as FilterFactory and use it as Filter like:
 filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory
 protected=protwords.txt /

 This is how my fieldType looks like:

fieldType name=text_kstem class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=false /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1

result of filtered field not indexed

2010-11-23 Thread Bernd Fehling

Dear list,
solr/lucene has a strange problem.
I'm currently using apache-solr-4.0-2010-10-12_08-05-48

I have written a MessageDigest for fields which generally works.
Part of my schema.xml is:
...
fieldType name=text_md class=solr.TextField
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory /
filter class=de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory 
mdAlgorithm=MD5 /
  /analyzer
/fieldType
...
!-- UNIQUE ID --
field name=id type=string indexed=true stored=true required=true /
...
field name=docid type=text_md indexed=true stored=true 
omitNorms=true /
...
copyField source=id dest=docid /
...

I have a field type text_md which uses the KeywordTokenizerFactory and then
my TextMessageDigestFilterFactory. As example I do a MD5 of id and store
it in docid.
The Field Analysis runs fine.
...
Index Analyzer
org.apache.solr.analysis.KeywordTokenizerFactory {luceneMatchVersion=LUCENE_40}
term position   1
term text   foo
term type   word
source start,end0,3
payload 
de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory {mdAlgorithm=MD5, 
luceneMatchVersion=LUCENE_40}
term position   1
term text   acbd18db4cc2f85cedef654fccc4a4d8
term type   word
source start,end0,3
payload 

The problem is that while loading via DIH the debugger shows that the 
TextMessageDigestFilterFactory
is called and running without problems and the result of my filter is properly 
returned,
but somehow the result never reaches the IndexWriter and gets stored to the 
index.

Any idea where to look at?

May be a class at a higher level doesn't recognize the change?

The above source start,end still has 0,3 even after the term text
has changed from foo to MD5 string. Should it then be 0,32 ?

Regards
Bernd

Re: result of filtered field not indexed

2010-11-24 Thread Bernd Fehling

Hi Rita,
thanks for the advice, one problem solved.
source start,end is now set to the correct value by the filter.

After further debugging it looks like this is a bug in Lucene indexer.
I wonder that noone ever noticed this...

Kind regards,
Bernd


Am 23.11.2010 09:07, schrieb Bernd Fehling:
 Dear list,
 solr/lucene has a strange problem.
 I'm currently using apache-solr-4.0-2010-10-12_08-05-48
 
 I have written a MessageDigest for fields which generally works.
 Part of my schema.xml is:
 ...
 fieldType name=text_md class=solr.TextField
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory /
 filter 
 class=de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory 
 mdAlgorithm=MD5 /
   /analyzer
 /fieldType
 ...
 !-- UNIQUE ID --
 field name=id type=string indexed=true stored=true required=true /
 ...
 field name=docid type=text_md indexed=true stored=true 
 omitNorms=true /
 ...
 copyField source=id dest=docid /
 ...
 
 I have a field type text_md which uses the KeywordTokenizerFactory and then
 my TextMessageDigestFilterFactory. As example I do a MD5 of id and store
 it in docid.
 The Field Analysis runs fine.
 ...
 Index Analyzer
 org.apache.solr.analysis.KeywordTokenizerFactory 
 {luceneMatchVersion=LUCENE_40}
 term position 1
 term text foo
 term type word
 source start,end  0,3
 payload   
 de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory {mdAlgorithm=MD5, 
 luceneMatchVersion=LUCENE_40}
 term position 1
 term text acbd18db4cc2f85cedef654fccc4a4d8
 term type word
 source start,end  0,3
 payload   
 
 The problem is that while loading via DIH the debugger shows that the 
 TextMessageDigestFilterFactory
 is called and running without problems and the result of my filter is 
 properly returned,
 but somehow the result never reaches the IndexWriter and gets stored to the 
 index.
 
 Any idea where to look at?
 
 May be a class at a higher level doesn't recognize the change?
 
 The above source start,end still has 0,3 even after the term text
 has changed from foo to MD5 string. Should it then be 0,32 ?
 
 Regards
 Bernd

Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Bernd Fehling

Dear list,
another suggestion about SignatureUpdateProcessorFactory.

Why can I make signatures of several fields and place the
result in one field but _not_ make a signature of one field
and place the result in several fields.

Could be realized without huge programming?

Best regards,
Bernd


Am 29.11.2010 14:30, schrieb Bernd Fehling:
 Dear list,
 
 a question about Solr SignatureUpdateProcessorFactory:
 
 for (String field : sigFields) {
   SolrInputField f = doc.getField(field);
   if (f != null) {
 *sig.add(field);
 Object o = f.getValue();
 if (o instanceof String) {
   sig.add((String)o);
 } else if (o instanceof Collection) {
   for (Object oo : (Collection)o) {
 if (oo instanceof String) {
   sig.add((String)oo);
 }
   }
 }
   }
 }
 
 Why is also the field name (* above) added to the signature
 and not only the content of the field?
 
 By purpose or by accident?
 
 I would like to suggest removing the field name from the signature and
 not mixing it up.
 
 Best regards,
 Bernd

Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Bernd Fehling


Am 29.11.2010 14:55, schrieb Markus Jelsma:
 
 
 On Monday 29 November 2010 14:51:33 Bernd Fehling wrote:
 Dear list,
 another suggestion about SignatureUpdateProcessorFactory.

 Why can I make signatures of several fields and place the
 result in one field but _not_ make a signature of one field
 and place the result in several fields.
 
 Use copyField


Ooooh yes, you are right.


 

 Could be realized without huge programming?

 Best regards,
 Bernd

 Am 29.11.2010 14:30, schrieb Bernd Fehling:
 Dear list,

 a question about Solr SignatureUpdateProcessorFactory:

 for (String field : sigFields) {

   SolrInputField f = doc.getField(field);
   if (f != null) {

 *sig.add(field);

 Object o = f.getValue();
 if (o instanceof String) {
 
   sig.add((String)o);
 
 } else if (o instanceof Collection) {
 
   for (Object oo : (Collection)o) {
   
 if (oo instanceof String) {
 
   sig.add((String)oo);
 
 }
   
   }
 
 }
   
   }

 }

 Why is also the field name (* above) added to the signature
 and not only the content of the field?

 By purpose or by accident?

 I would like to suggest removing the field name from the signature and
 not mixing it up.

 Best regards,
 Bernd

Re: question about Solr SignatureUpdateProcessorFactory

2010-11-30 Thread Bernd Fehling


 As mentioned, in the typical case it's important that the field names be 
 included in the signature, but i imagine there would be cases where you 
 wouldn't want them included (like a simple concat Signature for building 
 basic composite keys)
 
 I think the Signature API could definitely be enhanced to have additional 
 methods for adding field names vs adding field values.
 
 wanna open an issue in Jira sith some suggestions and use cases?
 
 
 -Hoss

Done.
Issue SOLR-2258 and SOLR-2258.patch as suggestion.

Best regards,
Bernd

Re: Creating Email Token Filter

2010-11-30 Thread Bernd Fehling


Am 30.11.2010 10:56, schrieb Greg Smith:
 Hi,
 
 I have written a plugin to filter on email types and keep those tokens,
 however when I run it in the analysis in the admin it all works fine.
 
 But when I use the data import handler to import the data and set the field
 type it doesn't remove the other tokens and keeps the field in the original
 form.
 
 I have sent the query and index analyzers to use the standard tokenizer
 factory and my custom email filter only.
 
 What could be causing this issue?
 

It sound like my misunderstanding which I had till the end of
last week about indexing and storing of solr/lucene databases.
I also had several Tokenizers and Filters and thought they aren't working
but only in analysis of admin.
As a matter of fact if they work in the analysis of admin then they work :-)
But you can't see it with the search result page, because the search result
page is always displaying the original stored value _not_ the tokenized or 
filtered
indexed value.
The Tokenized/Filtered content will be indexed which is not represented
with the result page.
Check with Schema Browser from admin what the indexed content of your
Tokenized/Filtered field is.

Best regards
Bernd

Re: Dataimport performance

2010-12-15 Thread Bernd Fehling

We are currently running Solr 4.x from trunk.

-d64 -Xms10240M -Xmx10240M

Total Rows Fetched: 24935988
Total Documents Skipped: 0
Total Documents Processed: 24568997
Time Taken: 5:55:19.104

24.5 Million Docs as XML from filesystem with less than 6 hours.

May be your MySQL is the bottleneck?

Regards
Bernd


Am 15.12.2010 14:40, schrieb Robert Gründler:
 Hi,
 
 we're looking for some comparison-benchmarks for importing large tables from 
 a mysql database (full import).
 
 Currently, a full-import of ~ 8 Million rows from a MySQL database takes 
 around 3 hours, on a QuadCore Machine with 16 GB of
 ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, 
 where it is the only app. The tomcat instance
 has the following memory-related java_opts:
 
 -Xms4096M -Xmx5120M
 
 
 The data-config.xml looks like this (only 1 entity):
 
   entity name=track query=select t.id as id, t.title as title, 
 l.title as label from track t left join label l on (l.id = t.label_id) where 
 t.deleted = 0 transformer=TemplateTransformer
 field column=title name=title_t /
 field column=label name=label_t /
 field column=id name=sf_meta_id /
 field column=metaclass template=Track name=sf_meta_class/
 field column=metaid template=${track.id} name=sf_meta_id/
 field column=uniqueid template=Track_${track.id} 
 name=sf_unique_id/
 
 entity name=artists query=select a.name as artist from artist a 
 left join track_artist ta on (ta.artist_id = a.id) where 
 ta.track_id=${track.id}
   field column=artist name=artists_t /
 /entity
 
   /entity
 
 
 We have the feeling that 3 hours for this import is quite long - regarding 
 the performance of the server running solr/mysql. 
 
 Are we wrong with that assumption, or do people experience similar import 
 times with this amount of data to be imported?
 
 
 thanks!
 
 
 -robert
 
 
 

-- 
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

names of index files

2011-01-02 Thread Bernd Fehling

Dear list,

some questions about the names of the index files.
With an older Solr 4.x version from trunk my index looks like:
_2t1.fdt
_2t1.fdx
_2t1.fnm
_2t1.frq
_2t1.nrm
_2t1.prx
_2t1.tii
_2t1.tis
segments_2
segments.gen

With a most recent version from trunk it looks like:
_3a9.fdt
_3a9.fdx
_3a9.fnm
_3a9_0.frq
_3a9.nrm
_3a9_0.prx
_3a9_0.tii
_3a9_0.tis
segments_4
segments.gen

Why is there an _0 at some files?
Is it from Lucene or from Solr or a fault in my system?

Both indexes are optimized, any idea?

Regards, Bernd

Re: WARNING: re-index all Lucene trunk indices

2011-01-05 Thread Bernd Fehling

Because this is also posted for solr-user and from some earlier
experiences with solr from trunk I think this is also recommended
for solr users living from trunk, right?

So solr trunk builds directly with lucene trunk?

Bernd


Am 05.01.2011 11:55, schrieb Michael McCandless:
 If you are using Lucene's trunk (to be 4.0) builds, read on...
 
 I just committed LUCENE-2843, which is a hard break on the index file format.
 
 If you are living on Lucene's trunk then you have to remove any
 previously created indices and re-index, after updating.
 
 The change cuts over to a more RAM efficient and faster terms index
 implementation, using FSTs (finite state transducers) to hold the term
 index data.
 
 Mike

DIH load only selected documents with XPathEntityProcessor

2011-01-06 Thread Bernd Fehling

Hello list,

is it possible to load only selected documents with XPathEntityProcessor?
While loading docs I want to drop/skip/ignore documents with missing URL.

Example:
documents
document
titlefirst title/title
ididentifier_01/id
linkhttp://www.foo.com/path/bar.html/link
/document
document
titlesecond title/title
ididentifier_02/id
link/link
/document
/documents

The first document should be loaded, the second document should be ignored
because it has an empty link (should also work for missing link field).

Best regards
Bernd

DIH Transformer

2011-01-07 Thread Bernd Fehling

Hi list,

currently the Transformers return row but can I skip
or drop a row from the Transformer?

If so, what should I return in that case, an empty row?

Regards,
Bernd

Re: DIH load only selected documents with XPathEntityProcessor

2011-01-10 Thread Bernd Fehling

Hi Gora,

thanks a lot, very nice solution, works perfectly.
I will dig more into ScriptTransformer, seems to be very powerful.

Regards,
Bernd

Am 08.01.2011 14:38, schrieb Gora Mohanty:
 On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Hello list,

 is it possible to load only selected documents with XPathEntityProcessor?
 While loading docs I want to drop/skip/ignore documents with missing URL.

 Example:
 documents
document
titlefirst title/title
ididentifier_01/id
linkhttp://www.foo.com/path/bar.html/link
/document
document
titlesecond title/title
ididentifier_02/id
link/link
/document
 /documents

 The first document should be loaded, the second document should be ignored
 because it has an empty link (should also work for missing link field).
 [...]
 
 You can use a ScriptTransformer, along with $skipRow/$skipDoc.
 E.g., something like this for your data import configuration file:
 
 dataConfig
 script![CDATA[
   function skipRow(row) {
 var link = row.get( 'link' );
 if( link == null || link == '' ) {
   row.put( '$skipRow', 'true' );
 }
 return row;
   }
 ]]/script
 dataSource type=FileDataSource /
 document
 entity name=f processor=FileListEntityProcessor
 baseDir=/home/gora/test fileName=.*xml newerThan='NOW-3DAYS'
 recursive=true rootEntity=false dataSource=null
 entity name=top processor=XPathEntityProcessor
 forEach=/documents/document url=${f.fileAbsolutePath}
 transformer=script:skipRow
field column=link xpath=/documents/document/link/
field column=title xpath=/documents/document/title/
field column=id xpath=/documents/document/id/
 /entity
 /entity
 /document
 /dataConfig
 
 Regards,
 Gora

strange SOLR behavior with required field attribute

2011-01-10 Thread Bernd Fehling

Dear list,

while trying different options with DIH and SciptTransformer I also
tried using the required=true option for a field.

I have 3 records:
documents
document
titlefirst title/title
ididentifier_01/id
linkhttp://www.foo.com/path/bar.html/link
/document
document
titlesecond title/title
ididentifier_02/id
link/link
/document
document
titlethierd title/title
ididentifier_03/id
/document
/documents

schema.xml snippet:
field name=title type=string indexed=true stored=true /
field name=id type=string indexed=true stored=true required=true /
field name=link type=string indexed=true stored=true required=true /

After loading I have 2 records in the index.

str name=titlefirst title/str
str name=ididentifier_01/str
str name=linkhttp://www.foo.com/path/bar.html/link

str name=titlesecond title/str
str name=ididentifier_02/str
str name=link/

Sure, I get an SolrException in the logs saying missing required field: link
but this is for the third record whereas the second record gets loaded even if
link is empty.

So I guess this is a feature of Solr?

And the required attribute means the presense of the tag and not
the presense of content for the tag, right?

Regards
Bernd

Re: strange SOLR behavior with required field attribute

2011-01-10 Thread Bernd Fehling

Hi Koji,

I'm using apache-solr-4.0-2010-11-24_09-25-17 from trunk.

A grep for SOLR-1973 in CHANGES.txt says that it should have been fixed.
Strange...

Regards,
Bernd



Am 10.01.2011 16:14, schrieb Koji Sekiguchi:
 (11/01/10 23:26), Bernd Fehling wrote:
 Dear list,

 while trying different options with DIH and SciptTransformer I also
 tried using the required=true option for a field.

 I have 3 records:
 documents
  document
  titlefirst title/title
  ididentifier_01/id
  linkhttp://www.foo.com/path/bar.html/link
  /document
  document
  titlesecond title/title
  ididentifier_02/id
  link/link
  /document
  document
  titlethierd title/title
  ididentifier_03/id
  /document
 /documents

 schema.xml snippet:
 field name=title type=string indexed=true stored=true /
 field name=id type=string indexed=true stored=true
 required=true /
 field name=link type=string indexed=true stored=true
 required=true /

 After loading I have 2 records in the index.

 str name=titlefirst title/str
 str name=ididentifier_01/str
 str name=linkhttp://www.foo.com/path/bar.html/link

 str name=titlesecond title/str
 str name=ididentifier_02/str
 str name=link/

 Sure, I get an SolrException in the logs saying missing required
 field: link
 but this is for the third record whereas the second record gets loaded
 even if
 link is empty.

 So I guess this is a feature of Solr?

 And the required attribute means the presense of the tag and not
 the presense of content for the tag, right?

 Regards
 Bernd
 
 Bernd,
 
 Seems like same problem of SOLR-1973 that I've recently fixed
 in trunk and 3x, but I'm not sure. Which version are you using?
 Can you try trunk or 3x? If you still get same error with trunk/3x,
 please open a jira issue.
 
 Koji

LukeRequestHandler histogram?

2011-01-14 Thread Bernd Fehling

Dear list,

what is the LukeRequestHandler histogram telling me?

Couldn't find any explanation and would be pleased to have it explained.

Many thanks in advance,
Bernd

Re: LukeRequestHandler histogram?

2011-01-14 Thread Bernd Fehling

Hi Stefan,

thanks a lot.

Regards,
Bernd


Am 14.01.2011 15:25, schrieb Stefan Matheis:
 Hi Bernd,
 
 there is an explanation from Hoss:
 http://search.lucidimagination.com/search/document/149e7d25415c0a36/some_kind_of_crazy_histogram#b22563120f1ec32b
 
 HTH
 Stefan
 
 On Fri, Jan 14, 2011 at 3:15 PM, Bernd Fehling 
 bernd.fehl...@uni-bielefeld.de wrote:
 
 Dear list,

 what is the LukeRequestHandler histogram telling me?

 Couldn't find any explanation and would be pleased to have it explained.

 Many thanks in advance,
 Bernd

Re: DIH with full-import and cleaning still keeps old index

2011-01-20 Thread Bernd Fehling


Looks like this is a bug and I should write a jira issue for it?

Regards
Bernd

Am 20.01.2011 11:30, schrieb Bernd Fehling:
 Hi list,
 
 after sending full-import=trueclean=truecommit=true
 Solr 4.x (apache-solr-4.0-2010-11-24_09-25-17) responds with:
 - DataImporter doFullImport
 - DirectUpdateHandler2 deleteAll
 ...
 - DocBuilder finish
 - SolrDeletionPolicy.onCommit: commits:num=2
 - SolrDeletionPolicy updateCommits
 - SolrIndexSearcher init
 - INFO: end_commit_flush
 - SolrIndexSearcher warm
 ...
 - QuerySenderListener newSearcher
 - SolrCore registerSearcher
 - SolrIndexSearcher close
 ...
 
 This all looks good to me but why is the old index not deleted?
 
 Am I missing a parameter?
 
 Regards,
 Bernd

Re: DIH with full-import and cleaning still keeps old index

2011-01-23 Thread Bernd Fehling


Is there a difference between sending optimize=true with
the full-import command or sending optimize=true as
a separate command after finishing full-import?

Regards,
Bernd


Am 23.01.2011 02:18, schrieb Espen Amble Kolstad:
 Your not doing optimize, I think optimize would delete your old index.
 Try it out with additional parameter optimize=true
 
 - Espen
 
 On Thu, Jan 20, 2011 at 11:30 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Hi list,

 after sending full-import=trueclean=truecommit=true
 Solr 4.x (apache-solr-4.0-2010-11-24_09-25-17) responds with:
 - DataImporter doFullImport
 - DirectUpdateHandler2 deleteAll
 ...
 - DocBuilder finish
 - SolrDeletionPolicy.onCommit: commits:num=2
 - SolrDeletionPolicy updateCommits
 - SolrIndexSearcher init
 - INFO: end_commit_flush
 - SolrIndexSearcher warm
 ...
 - QuerySenderListener newSearcher
 - SolrCore registerSearcher
 - SolrIndexSearcher close
 ...

 This all looks good to me but why is the old index not deleted?

 Am I missing a parameter?

 Regards,
 Bernd

Re: DIH with full-import and cleaning still keeps old index

2011-01-23 Thread Bernd Fehling


I sent commit=trueoptimize=true as a separate command but nothing
happened. Will try with additional options
waitFlush=falsewaitSearcher=falseexpungeDeletes=true

I wonder why the DIH admin GUI (debug.jsp) is not sending optimize=true
together with full-import ?

Regards,
Bernd


Am 24.01.2011 08:12, schrieb Espen Amble Kolstad:
 I think optimize only ever gets done when either a full-import or
 delta-import is done. You could optimize the normal way though see:
 http://wiki.apache.org/solr/UpdateXmlMessages
 
 - Espen
 
 On Mon, Jan 24, 2011 at 8:05 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:

 Is there a difference between sending optimize=true with
 the full-import command or sending optimize=true as
 a separate command after finishing full-import?

 Regards,
 Bernd


 Am 23.01.2011 02:18, schrieb Espen Amble Kolstad:
 Your not doing optimize, I think optimize would delete your old index.
 Try it out with additional parameter optimize=true

 - Espen

 On Thu, Jan 20, 2011 at 11:30 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 Hi list,

 after sending full-import=trueclean=truecommit=true
 Solr 4.x (apache-solr-4.0-2010-11-24_09-25-17) responds with:
 - DataImporter doFullImport
 - DirectUpdateHandler2 deleteAll
 ...
 - DocBuilder finish
 - SolrDeletionPolicy.onCommit: commits:num=2
 - SolrDeletionPolicy updateCommits
 - SolrIndexSearcher init
 - INFO: end_commit_flush
 - SolrIndexSearcher warm
 ...
 - QuerySenderListener newSearcher
 - SolrCore registerSearcher
 - SolrIndexSearcher close
 ...

 This all looks good to me but why is the old index not deleted?

 Am I missing a parameter?

 Regards,
 Bernd



-- 
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

solr admin result page error

2011-02-11 Thread Bernd Fehling

Dear list,

after loading some documents via DIH which also include urls
I get this yellow XML error page as search result from solr admin GUI
after a search.
It says XML processing error not well-formed.
The code it argues about is:

arr name=dcurls
strhttp://eprints.soton.ac.uk/43350//str
strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str
strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological 
dimension of Mackey functors for infinite groups. Journal of the
London Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143 
lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr

See the \u utf8-code in the last line.

1. the loaded data is valid, well-formed and checked with xmllint. No errors.
2. there is no \u utf8-code in the source data.
3. the data is loaded via DIH without any errors.
4. if opening the source-view of the result page with firefox there is also no 
\u utf8-code.

Only idea I have is solr itself or the result page generation.

How to proceed, what else to check?

Regards,
Bernd

Re: solr admin result page error

2011-02-11 Thread Bernd Fehling


Results so far.
I could locate and isolate the document causing trouble.
I've checked the document with xmllint again. It is valid, well-formed utf8.
I've loaded the single document and get the XML error if displaying the search 
result.
This is through solr admin search and also JSON interface, probably other
interfaces also.
Next step is to use debugger and see what goes wrong.

One thing I can already say is that it is utf8-code F0 9D 94 90 (U+1D510)
which makes the problem (Mathematical Fraktur Capital M).

Any already known issues about that?

Regards,
Bernd


Am 11.02.2011 08:59, schrieb Bernd Fehling:
 Dear list,
 
 after loading some documents via DIH which also include urls
 I get this yellow XML error page as search result from solr admin GUI
 after a search.
 It says XML processing error not well-formed.
 The code it argues about is:
 
 arr name=dcurls
 strhttp://eprints.soton.ac.uk/43350//str
 strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str
 strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological 
 dimension of Mackey functors for infinite groups. Journal of the
 London Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143 
 lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr
 
 See the \u utf8-code in the last line.
 
 1. the loaded data is valid, well-formed and checked with xmllint. No errors.
 2. there is no \u utf8-code in the source data.
 3. the data is loaded via DIH without any errors.
 4. if opening the source-view of the result page with firefox there is also 
 no \u utf8-code.
 
 Only idea I have is solr itself or the result page generation.
 
 How to proceed, what else to check?
 
 Regards,
 Bernd

Re: solr admin result page error

2011-02-11 Thread Bernd Fehling

Hi Markus,

yes it looks like the same issue. There is also a \u utf8-code in your dump.
Till now I followed it into XMLResponseWriter.
Some steps before the result in a buffer looks good and the utf8-code is 
correct.
Really hard to debug this freaky problem.

Have you looked deeper into this and located the bug?

It is definately a bug and has nothing to do with firefox.

Regards,
Bernd


Am 11.02.2011 13:48, schrieb Markus Jelsma:
 It looks like you hit the same issue as i did a while ago:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html
 
 
 On Friday 11 February 2011 08:59:27 Bernd Fehling wrote:
 Dear list,

 after loading some documents via DIH which also include urls
 I get this yellow XML error page as search result from solr admin GUI
 after a search.
 It says XML processing error not well-formed.
 The code it argues about is:

 arr name=dcurls
 strhttp://eprints.soton.ac.uk/43350//str
 strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str
 strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological
 dimension of Mackey functors for infinite groups. Journal of the London
 Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143
 lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr

 See the \u utf8-code in the last line.

 1. the loaded data is valid, well-formed and checked with xmllint. No
 errors. 2. there is no \u utf8-code in the source data.
 3. the data is loaded via DIH without any errors.
 4. if opening the source-view of the result page with firefox there is also
 no \u utf8-code.

 Only idea I have is solr itself or the result page generation.

 How to proceed, what else to check?

 Regards,
 Bernd
 

-- 
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

Re: solr admin result page error

2011-02-25 Thread Bernd Fehling

Hi Markus,

the result of my investigation is that Lucene currently can only handle
UTF-8 code within BMP [Basic Multilingual Plane] (plane 0) = 0x.

Any code above BMP might end in unpredictable results which is bad.
If you get invalid UTF-8 from the index and use wt=xml it gives the error
page. This is due to encoding=text/xml and charset=utf-8 in the header.
If you use wt=json then the encoding is text/plain and charset=utf-8.
Because of text/plain you don't get an error page but nevertheless the
content is invalid. I guess it replaces all invalid code with UTF-8 BOM.
So currently no solution, even not with JSON.

This should (hopefully) be fixed with Lucene 3.1.

Regards,
Bernd


Am 11.02.2011 15:50, schrieb Markus Jelsma:
 No i haven't located the issue. It might be Solr but it could also be Xerces 
 having trouble with it. You can possibly work around the problem by using the 
 JSONResponseWriter.
 
 On Friday 11 February 2011 15:45:23 Bernd Fehling wrote:
 Hi Markus,

 yes it looks like the same issue. There is also a \u utf8-code in your
 dump. Till now I followed it into XMLResponseWriter.
 Some steps before the result in a buffer looks good and the utf8-code is
 correct. Really hard to debug this freaky problem.

 Have you looked deeper into this and located the bug?

 It is definately a bug and has nothing to do with firefox.

 Regards,
 Bernd

 Am 11.02.2011 13:48, schrieb Markus Jelsma:
 It looks like you hit the same issue as i did a while ago:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html

 On Friday 11 February 2011 08:59:27 Bernd Fehling wrote:
 Dear list,

 after loading some documents via DIH which also include urls
 I get this yellow XML error page as search result from solr admin GUI
 after a search.
 It says XML processing error not well-formed.
 The code it argues about is:

 arr name=dcurls
 strhttp://eprints.soton.ac.uk/43350//str
 strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str
 strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006)
 Cohomological dimension of Mackey functors for infinite groups. Journal
 of the London Mathematical Society, 74, (2), 379-396.
 (doi:10.1112/S0024610706023143
 lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr

 See the \u utf8-code in the last line.

 1. the loaded data is valid, well-formed and checked with xmllint. No
 errors. 2. there is no \u utf8-code in the source data.
 3. the data is loaded via DIH without any errors.
 4. if opening the source-view of the result page with firefox there is
 also no \u utf8-code.

 Only idea I have is solr itself or the result page generation.

 How to proceed, what else to check?

 Regards,
 Bernd
 

-- 
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

Content-Type of XMLResponseWriter / QueryResponseWriter

2011-03-03 Thread Bernd Fehling

Dear list,

is there any deeper logic behind the fact that XMLResponseWriter
is sending CONTENT_TYPE_XML_UTF8=application/xml; charset=UTF-8 ?

I would assume (and also most browser) that for XML Output
to receive text/xml and not application/xml.

Or do you want the browser to call and XML-Editor with the result?

Best regards, Bernd

Re: Content-Type of XMLResponseWriter / QueryResponseWriter

2011-03-03 Thread Bernd Fehling

Hi Walter,

many thanks!

Bernd

Am 03.03.2011 17:01, schrieb Walter Underwood:
 Never use text/xml, that overrides any encoding declaration inside the XML 
 file.
 
 http://ln.hixie.ch/?start=1037398795count=1
 http://www.grauw.nl/blog/entry/489
 
 wunder
 ==
 Lead Engineer, MarkLogic
 
 On Mar 3, 2011, at 7:30 AM, Bernd Fehling wrote:
 
 Dear list,

 is there any deeper logic behind the fact that XMLResponseWriter
 is sending CONTENT_TYPE_XML_UTF8=application/xml; charset=UTF-8 ?

 I would assume (and also most browser) that for XML Output
 to receive text/xml and not application/xml.

 Or do you want the browser to call and XML-Editor with the result?

 Best regards, Bernd

from multiValued field to non-multiValued field with copyField?

2011-03-17 Thread Bernd Fehling



Is there a way to have a kind of casting for copyField?

I have author names in multiValued string field and need a sorting on it,
but sort on field is only for multiValued=false.

I'm trying to get multiValued content from one field to a
non-multiValued text or string field for sorting.
And this, if possible, during loading with copyField.

Or any other solution?

I need this solution due to patch SOLR-2339, which is now more strict.
May be anyone else also.

Regards,
Bernd

Re: from multiValued field to non-multiValued field with copyField?

2011-03-17 Thread Bernd Fehling



Good idea.
Was also just looking into this area.

Assuming my input record looks like this:
documents
  document id=foobar
element name=authorvalueauthor_1 ; author_2 ; 
author_3/value/element
  /document
/documents

Do you know if I can use something like this:
...
entity name=records processor=XPathEntityProcessor
transformer=RegexTransformer
...
field column=author  
xpath=/documents/document/element[@name='author']/value /
field column=author_sort 
xpath=/documents/document/element[@name='author']/value /
field column=author  splitBy= ;  /
...

To just double the input and make author multiValued and author_sort a string 
field?

Regards
Bernd


Am 17.03.2011 15:39, schrieb Gora Mohanty:

On Thu, Mar 17, 2011 at 8:04 PM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:


Is there a way to have a kind of casting for copyField?

I have author names in multiValued string field and need a sorting on it,
but sort on field is only for multiValued=false.

I'm trying to get multiValued content from one field to a
non-multiValued text or string field for sorting.
And this, if possible, during loading with copyField.

Or any other solution?

[...]

Not sure about CopyField, but you could use a transformer to
extract values from a multiValued field, and stick them into a
single-valued field.

Regards,
Gora

Re: from multiValued field to non-multiValued field with copyField?

2011-03-17 Thread Bernd Fehling


Hi Yonik,

actually some applications misused sorting on a multiValued field,
like VuFind. And as a matter oft fact also FAST doesn't support this
because it doesn't make sense.
FAST distinguishes between multiValue and singleValue by just adding
the seperator-FieldAttribute to the field. So I moved this from FAST
index-profile to Solr DIH and placed the seperator there.

But now I'm looking for a solution for VuFind.
Easiest thing would be to have a kind of casting, may be for copyField.

Regards,
Bernd


Am 17.03.2011 15:58, schrieb Yonik Seeley:

On Thu, Mar 17, 2011 at 10:34 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:


Is there a way to have a kind of casting for copyField?

I have author names in multiValued string field and need a sorting on it,
but sort on field is only for multiValued=false.

I'm trying to get multiValued content from one field to a
non-multiValued text or string field for sorting.
And this, if possible, during loading with copyField.

Or any other solution?

I need this solution due to patch SOLR-2339, which is now more strict.
May be anyone else also.


Hmmm, you're the second person that's relied on that (sorting on a
multiValued field working).
Was SOLR-2339 a mistake?

-Yonik
http://lucidimagination.com

Re: from multiValued field to non-multiValued field with copyField?

2011-03-17 Thread Bernd Fehling


Hi Bill,
yes DIH is in use.

Thanks,
Bernd

Am 17.03.2011 16:09, schrieb Bill Bell:

Do you use Dih handler? A script can do this easily.

Bill Bell
Sent from mobile


On Mar 17, 2011, at 9:02 AM, Bernd Fehlingbernd.fehl...@uni-bielefeld.de  
wrote:



Good idea.
Was also just looking into this area.

Assuming my input record looks like this:
documents
  document id=foobar
element name=authorvalueauthor_1 ; author_2 ; 
author_3/value/element
  /document
/documents

Do you know if I can use something like this:
...
entity name=records processor=XPathEntityProcessor
transformer=RegexTransformer
...
field column=author  
xpath=/documents/document/element[@name='author']/value /
field column=author_sort 
xpath=/documents/document/element[@name='author']/value /
field column=author  splitBy= ;  /
...

To just double the input and make author multiValued and author_sort a string 
field?

Regards
Bernd


Am 17.03.2011 15:39, schrieb Gora Mohanty:

On Thu, Mar 17, 2011 at 8:04 PM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de   wrote:


Is there a way to have a kind of casting for copyField?

I have author names in multiValued string field and need a sorting on it,
but sort on field is only for multiValued=false.

I'm trying to get multiValued content from one field to a
non-multiValued text or string field for sorting.
And this, if possible, during loading with copyField.

Or any other solution?

[...]

Not sure about CopyField, but you could use a transformer to
extract values from a multiValued field, and stick them into a
single-valued field.

Regards,
Gora


--
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*

1 2 3 4 >

1 - 100 of 396 matches

Mail list logo