Large Hdd-Space using during commit/optimize

2010-11-29 Thread stockii

Hello.

i have ~37 Million Docs that i want to index. 

when i starte a full-import i importing only every 2 Million Docs, because
of better controll over solr and space/heap 

so when i import 2 million docs and solr start the commit and the optimize
my used disc-space jumps into the sky. reacten: solr restart and space the
used space goes down.

why is using solr so many space ?  

can i optimize that  ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1985807.html
Sent from the Solr - User mailing list archive at Nabble.com.


ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-29 Thread Martin Grotzke
Hi,

after an upgrade from solr-1.3 to 1.4.1 we're getting an
ArrayIndexOutOfBoundsException for a query with rows=0 and a sort
param specified:

java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660)
at 
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.collect(TopFieldCollector.java:84)
at 
org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1391)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:872)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)

The query is e.g.:
/select/?sort=popularity+descrows=0start=0q=foo

When this is changed to rows=1 or when the sort param is removed the
exception is gone and everything's fine.

With a clean 1.4.1 installation (unzipped, started example and posted
two documents as described in the tutorial) this issue is not
reproducable.

Does anyone have a clue what might be the reason for this and how we
could fix this on the solr side?
Of course - for a quick fix - I'll change our app so that there's no
sort param specified when rows=0.

Thanx  cheers,
Martin

-- 
Martin Grotzke
http://twitter.com/martin_grotzke


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread Upayavira


On Mon, 29 Nov 2010 03:07 -0800, stockii st...@shopgate.com wrote:
 
 Hello.
 
 i have ~37 Million Docs that i want to index. 
 
 when i starte a full-import i importing only every 2 Million Docs,
 because
 of better controll over solr and space/heap 
 
 so when i import 2 million docs and solr start the commit and the
 optimize
 my used disc-space jumps into the sky. reacten: solr restart and space
 the
 used space goes down.
 
 why is using solr so many space ?  
 
 can i optimize that  ? 

What do you mean into the sky? What percentage increase are you
seeing?

I'd expect it to double at least. I've heard it suggested that you
should have three times the usual space available for an optimise.

Remember, when your index is optimising, you'll want to keep the
original index online and available for searches, so you'll have at
least two copies of your index on disk during an optimise.

Also, it is my understanding that if you commit infrequently, you won't
need to optimise immediately. There's nothing to stop you importing your
entire corpus, then doing a single commit. That will leave you with only
one segment (or at most two - one that existed before and was empty, and
one containing all of your documents). The net result being you don't
need to optimise at that point.

Note - I'm no solr guru, so I could be wrong with some of the above -
I'm happy to be corrected.

Upayavira


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread Erick Erickson
First, don't optimize after every chunk, it's just making extra work for
your system.
If you're using a 3.x or trunk build, optimizing doesn't do much for you
anyway, but
if you must, just optimize after your entire import is done.

Optimizing will pretty much copy the old index into a new set of files, so
you can expect your disk space to at least double because Solr/Lucene
doesn't
delete anything until it's sure that the optimize finished successfully.
Imagine
the consequence of deleting files as they were copied to save disk space.
Now
hit a program error, power glitch or ctrl-c. Your indexes would be
corrupted.

Best
Erick

On Mon, Nov 29, 2010 at 6:07 AM, stockii st...@shopgate.com wrote:


 Hello.

 i have ~37 Million Docs that i want to index.

 when i starte a full-import i importing only every 2 Million Docs, because
 of better controll over solr and space/heap 

 so when i import 2 million docs and solr start the commit and the optimize
 my used disc-space jumps into the sky. reacten: solr restart and space the
 used space goes down.

 why is using solr so many space ?

 can i optimize that  ?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1985807.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Bernd Fehling
Dear list,
another suggestion about SignatureUpdateProcessorFactory.

Why can I make signatures of several fields and place the
result in one field but _not_ make a signature of one field
and place the result in several fields.

Could be realized without huge programming?

Best regards,
Bernd


Am 29.11.2010 14:30, schrieb Bernd Fehling:
 Dear list,
 
 a question about Solr SignatureUpdateProcessorFactory:
 
 for (String field : sigFields) {
   SolrInputField f = doc.getField(field);
   if (f != null) {
 *sig.add(field);
 Object o = f.getValue();
 if (o instanceof String) {
   sig.add((String)o);
 } else if (o instanceof Collection) {
   for (Object oo : (Collection)o) {
 if (oo instanceof String) {
   sig.add((String)oo);
 }
   }
 }
   }
 }
 
 Why is also the field name (* above) added to the signature
 and not only the content of the field?
 
 By purpose or by accident?
 
 I would like to suggest removing the field name from the signature and
 not mixing it up.
 
 Best regards,
 Bernd


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Erick Erickson
Why do you want to do this? It'd be the same value, just stored in
multiple fields in the document, which seems a waste. What's
the use-case you're addressing?

Best
Erick

On Mon, Nov 29, 2010 at 8:51 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 Dear list,
 another suggestion about SignatureUpdateProcessorFactory.

 Why can I make signatures of several fields and place the
 result in one field but _not_ make a signature of one field
 and place the result in several fields.

 Could be realized without huge programming?

 Best regards,
 Bernd


 Am 29.11.2010 14:30, schrieb Bernd Fehling:
  Dear list,
 
  a question about Solr SignatureUpdateProcessorFactory:
 
  for (String field : sigFields) {
SolrInputField f = doc.getField(field);
if (f != null) {
  *sig.add(field);
  Object o = f.getValue();
  if (o instanceof String) {
sig.add((String)o);
  } else if (o instanceof Collection) {
for (Object oo : (Collection)o) {
  if (oo instanceof String) {
sig.add((String)oo);
  }
}
  }
}
  }
 
  Why is also the field name (* above) added to the signature
  and not only the content of the field?
 
  By purpose or by accident?
 
  I would like to suggest removing the field name from the signature and
  not mixing it up.
 
  Best regards,
  Bernd



Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Markus Jelsma


On Monday 29 November 2010 14:51:33 Bernd Fehling wrote:
 Dear list,
 another suggestion about SignatureUpdateProcessorFactory.
 
 Why can I make signatures of several fields and place the
 result in one field but _not_ make a signature of one field
 and place the result in several fields.

Use copyField

 
 Could be realized without huge programming?
 
 Best regards,
 Bernd
 
 Am 29.11.2010 14:30, schrieb Bernd Fehling:
  Dear list,
  
  a question about Solr SignatureUpdateProcessorFactory:
  
  for (String field : sigFields) {
  
SolrInputField f = doc.getField(field);
if (f != null) {
  
  *sig.add(field);
  
  Object o = f.getValue();
  if (o instanceof String) {
  
sig.add((String)o);
  
  } else if (o instanceof Collection) {
  
for (Object oo : (Collection)o) {

  if (oo instanceof String) {
  
sig.add((String)oo);
  
  }

}
  
  }

}
  
  }
  
  Why is also the field name (* above) added to the signature
  and not only the content of the field?
  
  By purpose or by accident?
  
  I would like to suggest removing the field name from the signature and
  not mixing it up.
  
  Best regards,
  Bernd

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Bernd Fehling

Am 29.11.2010 14:55, schrieb Markus Jelsma:
 
 
 On Monday 29 November 2010 14:51:33 Bernd Fehling wrote:
 Dear list,
 another suggestion about SignatureUpdateProcessorFactory.

 Why can I make signatures of several fields and place the
 result in one field but _not_ make a signature of one field
 and place the result in several fields.
 
 Use copyField


Ooooh yes, you are right.


 

 Could be realized without huge programming?

 Best regards,
 Bernd

 Am 29.11.2010 14:30, schrieb Bernd Fehling:
 Dear list,

 a question about Solr SignatureUpdateProcessorFactory:

 for (String field : sigFields) {

   SolrInputField f = doc.getField(field);
   if (f != null) {

 *sig.add(field);

 Object o = f.getValue();
 if (o instanceof String) {
 
   sig.add((String)o);
 
 } else if (o instanceof Collection) {
 
   for (Object oo : (Collection)o) {
   
 if (oo instanceof String) {
 
   sig.add((String)oo);
 
 }
   
   }
 
 }
   
   }

 }

 Why is also the field name (* above) added to the signature
 and not only the content of the field?

 By purpose or by accident?

 I would like to suggest removing the field name from the signature and
 not mixing it up.

 Best regards,
 Bernd
 


Using Ngram and Phrase search

2010-11-29 Thread Jason, Kim

Hi, all
I want to use both EdegeNGram analysis and phrase search.
But there is some problem.

On Field which is not using EdgeNGram analysis, phrase search.is good work.
But if using EdgeNGram then phrase search is incorrect.

Now I'm using Solr1.4.0.
Result of EdgeNGram analysis for pci express is below.
http://lucene.472066.n3.nabble.com/file/n1986848/before.jpg 

I thought cause is term position.
So I modified EdgeNGramTokenFilter of lucene-analyzers-2.9.1.
After modified, result is below.
http://lucene.472066.n3.nabble.com/file/n1986848/after.jpg 

So phrase search fot pci express from ngram index is good work.
But another problem is happend.

For example, when I searh phrase query pc express, docs included 'pci
express' are searched too.
In this case I don't want to search for 'pci express'.
I just want exact match pc express.

Please give your ideas.
Thanks,
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Ngram-and-Phrase-search-tp1986848p1986848.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Hot Backup

2010-11-29 Thread Rodolico Piero
Hi all,

How can I backup indexes Solr without stopping the server?

I saw the following link:

 

http://wiki.apache.org/solr/SolrOperationsTools
http://wiki.apache.org/solr/SolrOperationsTools 

http://wiki.apache.org/solr/CollectionDistribution

 

but I'm afraid that running these scripts 'on the fly' indexes could be
corrupted.

Thanks,

Piero.

 

 



search strangeness

2010-11-29 Thread ramzesua

Hi all. I have a little question. Can anyone explain, why this solr search
work so strange? :)
For example, I make schema.xml:
I add some fields with fieldType = text. Here 'text' properties
fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory splitOnNumerics=0
generateWordParts=1 generateNumberParts=0 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
I copied to text field all my fields:
copyField source=name dest=text/
copyField source=caption dest=text/


Then I add one document to my index. Here schema browser for field
'caption':

_term___frequency_
|annual |1 |
|golfer |1 |
|tournament |1 |
|welcom |1 |
|3rd|1 |

After that I tried to find this document by terms:
annual - no results
golfer  - found document
tournament - no results
welcom - found document
3rd - no results

I read a lot of forums, some books and http://wiki.apache.org/solr/ but
it don't help me.
Can anyone explain me, why solr search so strange? Or where is my problem?
Thank you ...

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1986895.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Hot Backup

2010-11-29 Thread Upayavira
As I understand it, those tools are more Solr 1.3 related, but I don't
see why they shouldn't work on 1.4.

I would say it is very unlikely that you will corrupt an index with
them.

Lucene indexes are write once, that is, any one index file will never
be updated, only replaced. This means that taking a backup is actually
exceptionally easy, as (on Unix at least) you can create a copy of the
index directory with hard links, which takes milliseconds, even for
multi-gigabyte indexes. You just need to make sure you are not
committing while you take your backup, and it looks like those tools
will take care of that for you.

Another perk is that your backups won't take any additional disk space
(just the space for the directory data, not the files themselves). As
your index changes, disk usage will gradually increase though.

Upayavira

On Mon, 29 Nov 2010 16:13 +0100, Rodolico Piero
p.rodol...@vitrociset.it wrote:
 Hi all,
 
 How can I backup indexes Solr without stopping the server?
 
 I saw the following link:
 
  
 
 http://wiki.apache.org/solr/SolrOperationsTools
 http://wiki.apache.org/solr/SolrOperationsTools 
 
 http://wiki.apache.org/solr/CollectionDistribution
 
  
 
 but I'm afraid that running these scripts 'on the fly' indexes could be
 corrupted.
 
 Thanks,
 
 Piero.
 
  
 
  
 


BasicHelloRequestHandler plugin

2010-11-29 Thread Hong-Thai Nguyen
Hi,

Thank for helping us.

I’m creating a ‘helloword’ plugin in Solr 1.4 in BasicHelloRequestHandler.java

In solrconfig.xml, I added:

 

requestHandler name=hello 
class=com.polyspot.mercury.handler.BasicHelloRequestHandler 

!-- default values for query parameters --

 lst name=defaults

   str name=messageDefault message/str 

   int name=anumber-10/int 

 /lst

  /requestHandler

 

I verified ‘hello’ plugin is figured well at: 
http://localhost:8983/solr/admin/plugins

 

When I executed: http://localhost:8983/solr/select?qt=hello, the 
java.lang.AbstractMethodError raised:

type Rapport d'état

message null java.lang.AbstractMethodError at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) 
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) 
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) 
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) 
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) 
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at 
java.lang.Thread.run(Thread.java:595) 

I supposed that handleRequest in the BasicHelloRequestHandler isn’t called.

Here’s BasicHelloRequestHandler .java code:

import com.polyspot.mercury.common.params.HelloParams;

import org.apache.solr.common.SolrException;

import org.apache.solr.common.params.SolrParams;

import org.apache.solr.common.util.NamedList;

import org.apache.solr.common.util.SimpleOrderedMap;

import org.apache.solr.request.SolrQueryRequest;

import org.apache.solr.request.SolrRequestHandler;

import org.apache.solr.response.SolrQueryResponse;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

 

import java.net.URL;

 

 

/**

 * User: nguyenht

 * Date: 26 nov. 2010

 */

public class BasicHelloRequestHandler implements SolrRequestHandler {

 

  protected static Logger log = 
LoggerFactory.getLogger(BasicHelloRequestHandler.class);

 

  protected NamedList initArgs = null;

  protected SolrParams defaults;

 

 

  /**

   * codeinit/code will be called just once, immediately after creation.

   * pThe args are user-level initialization parameters that

   * may be specified when declaring a request handler in

   * solrconfig.xml

   */

  public void init(NamedList args) {

log.info(initializing BasicHelloRequestHandler:  + args);

 

initArgs = args;

 

if (args != null) {

  Object o = args.get(defaults);

  if (o != null  o instanceof NamedList) {

defaults = SolrParams.toSolrParams((NamedList) o);

  }

}

 

  }

 

  /**

   * Handles a query request, this method must be thread safe.

   * p/

   * Information about the request may be obtained from codereq/code and

   * response information may be set using codersp/code.

   * p/

   * There are no mandatory actions that handleRequest must perform.

   * An empty handleRequest implementation would fulfill

   * all interface obligations.

   */

  public void handleRequest(SolrQueryRequest solrQueryRequest, 
SolrQueryResponse solrQueryResponse) {

 

log.info(handling request for BasicHelloRequestHandler: );

 

//get request params

SolrParams params = solrQueryRequest.getParams();

String message = params.get(HelloParams.MESSAGE);

 

if (message == null)

{ 

  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, message is 
mandatory);

}

 

 

log.info(get anumber );

 

Integer anumber = params.getInt(HelloParams.ANUMBER);

if (anumber == null)

{

  anumber = defaults.getInt(HelloParams.ANUMBER);

}



 

 

int messageLength = message.length();

 

 

//write response

solrQueryResponse.add(yousaid, message);

solrQueryResponse.add(message length, messageLength);

solrQueryResponse.add(optionalNumber, anumber);

 

 

  }

 

  /*

  methods below are for JMX info

   */

 

 public String getName() {

return this.getClass().getName();

  }

 

  public String getVersion() {

return 1;  //TODO implement this

  }

 

  public String getDescription() {

return hello;  //TODO implement this

  }

 

 

Preventing index segment corruption when windows crashes

2010-11-29 Thread Peter Sturge
Hi,

With the advent of new windows versions, there are increasing
instances of system blue-screens, crashes, freezes and ad-hoc
failures.
If a Solr index is running at the time of a system halt, this can
often corrupt a segments file, requiring the index to be -fix'ed by
rewriting the offending file.
Aside from the vagaries of automating such fixes, depending on the
mergeFactor, this can be quite a few documents permanently lost.

Would anyone have any experience/wisdom/insight on ways to mitigate
such corruption in Lucene/Solr - e.g. applying a temp file technique
etc.; though perhaps not 'just use Linux'.. :-)
There are of course, client-side measures that can hold some number of
pending documents until they are truly committed, but a
server-side/Lucene method would be perferable, if possible.

Thanks,
Peter


Re: search strangeness

2010-11-29 Thread Erick Erickson
On a quick look with Solr 3.1, these results are puzzling. Are you
sure that you are searching the field you think you are? I take it you're
searching the text field, but that's controlled by your
defaultSearchField
entry in schema.xml.

Try using the admin page, particularly the full interface link and
turn debugging on, that should give you a better idea of what
is actually being searched. Another admin page that's very useful
is the analysis page, that'll show you exactly what transformations
are made to your terms at index and query time and why.

I'm a little suspicious that you've put the stopword filter in a different
place in the index and query process, but I doubt that
is a problem. The analysis page will help with that too.

But nothing really jumps out at me, if you don't get anywhere with the
admin page, perhaps you can show us the field definitions for the
name, caption and text fields (not the type, the actual field/field
part of the schema).

Also, please post the results of appending debugQuery=on to the request.

Best
Erick

On Mon, Nov 29, 2010 at 10:06 AM, ramzesua michaelnaza...@gmail.com wrote:


 Hi all. I have a little question. Can anyone explain, why this solr search
 work so strange? :)
 For example, I make schema.xml:
 I add some fields with fieldType = text. Here 'text' properties
 fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory splitOnNumerics=0
 generateWordParts=1 generateNumberParts=0 catenateWords=0
 catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
 I copied to text field all my fields:
 copyField source=name dest=text/
 copyField source=caption dest=text/


 Then I add one document to my index. Here schema browser for field
 'caption':

 _term___frequency_
 |annual |1 |
 |golfer |1 |
 |tournament |1 |
 |welcom |1 |
 |3rd|1 |

 After that I tried to find this document by terms:
 annual - no results
 golfer  - found document
 tournament - no results
 welcom - found document
 3rd - no results

 I read a lot of forums, some books and http://wiki.apache.org/solr/
 but
 it don't help me.
 Can anyone explain me, why solr search so strange? Or where is my problem?
 Thank you ...

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1986895.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Solr DataImportHandler (DIH) and Cassandra

2010-11-29 Thread Mark

Is there anyway to use DIH to import from Cassandra? Thanks


bf for Dismax completly ignored by 'recip(ms(NOW,INDAT),3.16e-11,1,1)'

2010-11-29 Thread rall0r

Hello,
I got a problem that I'm unable to solve: As mentioned in the docs, I put in
a recip(ms(NOW,INDAT),3.16e-11,1,1) at the boost-Function fielf bf.
That is completly ignored by the dismax Search Handler.

The dismax SearchHandler is set to be the default SearchHandler.
If I post a solr/select?q={!boost
b=recip(ms(NOW,INDAT),3.16e-11,1,1)}SearchTerm to the solr-Server, the
request is answered as expected, while doing it with the php-Client
completly fails.

The solrconfig looks like:

str name=bfrecip(ms(NOW,INDAT),3.16e-11,1,1)/str

My someone has an idea?
Thanks a lot!
Ralf
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/bf-for-Dismax-completly-ignored-by-recip-ms-NOW-INDAT-3-16e-11-1-1-tp1987228p1987228.html
Sent from the Solr - User mailing list archive at Nabble.com.


Boost on newer documents

2010-11-29 Thread Jason Brown

Hi,

I use the dismax query to search across several fields.

I find I have a lot of documents with the same document name (one of the fields 
that the dismax queries) so I wanted to adjust the relevance so that titles 
with a newer published date have a higher relevance than documents with the 
same title but are older. Does anyone know how I can achieve this?

Thank You

Jason.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: Boost on newer documents

2010-11-29 Thread Stefan Matheis
Hi Jason,

maybe, just use another field w/ creation-/modification-date and boost on
this field?

Regards
Stefan

On Mon, Nov 29, 2010 at 5:28 PM, Jason Brown jason.br...@sjp.co.uk wrote:


 Hi,

 I use the dismax query to search across several fields.

 I find I have a lot of documents with the same document name (one of the
 fields that the dismax queries) so I wanted to adjust the relevance so that
 titles with a newer published date have a higher relevance than documents
 with the same title but are older. Does anyone know how I can achieve this?

 Thank You

 Jason.

 If you wish to view the St. James's Place email disclaimer, please use the
 link below

 http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer



Re: Boost on newer documents

2010-11-29 Thread Mat Brown
Hi Jason,

You can use boost functions in the dismax handler to do this:

http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29

Mat

On Mon, Nov 29, 2010 at 11:28, Jason Brown jason.br...@sjp.co.uk wrote:

 Hi,

 I use the dismax query to search across several fields.

 I find I have a lot of documents with the same document name (one of the 
 fields that the dismax queries) so I wanted to adjust the relevance so that 
 titles with a newer published date have a higher relevance than documents 
 with the same title but are older. Does anyone know how I can achieve this?

 Thank You

 Jason.

 If you wish to view the St. James's Place email disclaimer, please use the 
 link below

 http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer



RE: Boost on newer documents

2010-11-29 Thread Jason Brown
Great - Thank You.


-Original Message-
From: Mat Brown [mailto:m...@patch.com]
Sent: Mon 29/11/2010 16:33
To: solr-user@lucene.apache.org
Subject: Re: Boost on newer documents
 
Hi Jason,

You can use boost functions in the dismax handler to do this:

http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29

Mat

On Mon, Nov 29, 2010 at 11:28, Jason Brown jason.br...@sjp.co.uk wrote:

 Hi,

 I use the dismax query to search across several fields.

 I find I have a lot of documents with the same document name (one of the 
 fields that the dismax queries) so I wanted to adjust the relevance so that 
 titles with a newer published date have a higher relevance than documents 
 with the same title but are older. Does anyone know how I can achieve this?

 Thank You

 Jason.

 If you wish to view the St. James's Place email disclaimer, please use the 
 link below

 http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer



If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread stockii

aha okay. thx

i dont know that solr copys the complete index for optimize. can i solr say,
that he start an optimize, but wihtout copy ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1987477.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Large Hdd-Space using during commit/optimize

2010-11-29 Thread Upayavira
On Mon, 29 Nov 2010 08:43 -0800, stockii st...@shopgate.com wrote:
 
 aha okay. thx
 
 i dont know that solr copys the complete index for optimize. can i solr
 say,
 that he start an optimize, but wihtout copy ? 

No.

The copy is to keep an index available for searches while the optimise
is happening.

Also, to allow for rollback should something go wrong with the optimise.

The simplest thing is to keep your commits low (I suspect you could
ingest 35m documents with just one commit at the end).

In that case, optimisation is not required (optimisation is to reduce
the number of segments in your index, and segments are created by
commits. If you don't do many commits, you won't need to optimise - at
least you won't at the point of initial ingestion.

Upayavira


Re: Preventing index segment corruption when windows crashes

2010-11-29 Thread Yonik Seeley
On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com wrote:
 If a Solr index is running at the time of a system halt, this can
 often corrupt a segments file, requiring the index to be -fix'ed by
 rewriting the offending file.

Really?  That shouldn't be possible (if you mean the index is truly
corrupt - i.e. you can't open it).

-Yonik
http://www.lucidimagination.com


DIH causing shutdown hook executing?

2010-11-29 Thread Phong Dais
Hi,

I am in the process of trying to index about 50 mil documents using the data
import handler.
For some reason, about 2 days into the import, I see this message shutdown
hook executing in the log and the solr web server instance exits
gracefully.
I do not see any errors in the entire log.  This has happened twice now,
usually 5 mil or so documents into the import process.

Does anyone out there knows what this message mean?  It's an INFO log
message so I don't think it is caused by any error.
Does this problem occur because the os is asking the server to shut down
(for whatever reason) or is there something wrong with the server causing it
to shutdown?

Thanks for any help,
Phong


Re: Solr Hot Backup

2010-11-29 Thread Jonathan Rochkind
In Solr 1.4, I think the replication features should be able to 
accomplish your goal, and will be easier to use and more robust.


On 11/29/2010 10:22 AM, Upayavira wrote:

As I understand it, those tools are more Solr 1.3 related, but I don't
see why they shouldn't work on 1.4.

I would say it is very unlikely that you will corrupt an index with
them.

Lucene indexes are write once, that is, any one index file will never
be updated, only replaced. This means that taking a backup is actually
exceptionally easy, as (on Unix at least) you can create a copy of the
index directory with hard links, which takes milliseconds, even for
multi-gigabyte indexes. You just need to make sure you are not
committing while you take your backup, and it looks like those tools
will take care of that for you.

Another perk is that your backups won't take any additional disk space
(just the space for the directory data, not the files themselves). As
your index changes, disk usage will gradually increase though.

Upayavira

On Mon, 29 Nov 2010 16:13 +0100, Rodolico Piero
p.rodol...@vitrociset.it  wrote:

Hi all,

How can I backup indexes Solr without stopping the server?

I saw the following link:



http://wiki.apache.org/solr/SolrOperationsTools
http://wiki.apache.org/solr/SolrOperationsTools

http://wiki.apache.org/solr/CollectionDistribution



but I'm afraid that running these scripts 'on the fly' indexes could be
corrupted.

Thanks,

Piero.







R: Solr Hot Backup

2010-11-29 Thread Rodolico Piero
Yes, I use the replication only for backup with this call:

http://host:8080/solr/replication?command=backuplocation=/home/jboss/backup 

It's work fine but the server must be always up... it's an http call...
I tried also the script 'backup' but it creates hard links and are not 
recommended!


-Messaggio originale-
Da: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Inviato: lunedì 29 novembre 2010 19.22
A: solr-user@lucene.apache.org
Oggetto: Re: Solr Hot Backup

In Solr 1.4, I think the replication features should be able to 
accomplish your goal, and will be easier to use and more robust.

On 11/29/2010 10:22 AM, Upayavira wrote:
 As I understand it, those tools are more Solr 1.3 related, but I don't
 see why they shouldn't work on 1.4.

 I would say it is very unlikely that you will corrupt an index with
 them.

 Lucene indexes are write once, that is, any one index file will never
 be updated, only replaced. This means that taking a backup is actually
 exceptionally easy, as (on Unix at least) you can create a copy of the
 index directory with hard links, which takes milliseconds, even for
 multi-gigabyte indexes. You just need to make sure you are not
 committing while you take your backup, and it looks like those tools
 will take care of that for you.

 Another perk is that your backups won't take any additional disk space
 (just the space for the directory data, not the files themselves). As
 your index changes, disk usage will gradually increase though.

 Upayavira

 On Mon, 29 Nov 2010 16:13 +0100, Rodolico Piero
 p.rodol...@vitrociset.it  wrote:
 Hi all,

 How can I backup indexes Solr without stopping the server?

 I saw the following link:



 http://wiki.apache.org/solr/SolrOperationsTools
 http://wiki.apache.org/solr/SolrOperationsTools

 http://wiki.apache.org/solr/CollectionDistribution



 but I'm afraid that running these scripts 'on the fly' indexes could be
 corrupted.

 Thanks,

 Piero.







Re: Spellcheck in solr-nutch integration

2010-11-29 Thread Anurag

i solved the problemAll we need to modify schema file.

Also the spellcheck index is created first when spellcheck.build=true 

-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p1988252.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH causing shutdown hook executing?

2010-11-29 Thread Erick Erickson
You're right, the OS is asking the server to shut down.  In the default
example under Jetty, this is a result of issuing a crtl-c. Is it possible
that something is asking your server to quit? What servlet container
are you running under? Does the Solr server run for more than this
period if you're NOT indexing? And are you sure you have enough
resources, especially disk space?

On another note, I'm surprised that it's taking 2 days to index 5m
documents.
That's less than 30 docs/second and Solr should handle a considerably
greater load than that. For whatever that's worth...

And what version of Solr are you using? You may want to consider
writing something in SolrJ to do your indexing, it'll provide you more
flexible control over indexing than DIH..

Best
Erick

On Mon, Nov 29, 2010 at 1:20 PM, Phong Dais phong.gd...@gmail.com wrote:

 Hi,

 I am in the process of trying to index about 50 mil documents using the
 data
 import handler.
 For some reason, about 2 days into the import, I see this message shutdown
 hook executing in the log and the solr web server instance exits
 gracefully.
 I do not see any errors in the entire log.  This has happened twice now,
 usually 5 mil or so documents into the import process.

 Does anyone out there knows what this message mean?  It's an INFO log
 message so I don't think it is caused by any error.
 Does this problem occur because the os is asking the server to shut down
 (for whatever reason) or is there something wrong with the server causing
 it
 to shutdown?

 Thanks for any help,
 Phong



solr admin

2010-11-29 Thread Papp Richard
Hello,

  is there any way to specify in the solr admin other than fields? and I'm
nt talking about the full interface which is also very limited.

  like: score, fl, fq, ...

  and yes, I know that I can use the url... which indeed is not too handy.

thanks,
  Rich
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



Re: solr admin

2010-11-29 Thread Erick Erickson
I honestly don't understand what you're asking here. Specify what
in solr admin other than fields? what is it you're trying to accomplish?

Best
Erick

On Mon, Nov 29, 2010 at 2:56 PM, Papp Richard ccode...@gmail.com wrote:

 Hello,

  is there any way to specify in the solr admin other than fields? and I'm
 nt talking about the full interface which is also very limited.

  like: score, fl, fq, ...

  and yes, I know that I can use the url... which indeed is not too handy.

 thanks,
  Rich


 __ Information from ESET NOD32 Antivirus, version of virus
 signature
 database 5659 (20101129) __

 The message was checked by ESET NOD32 Antivirus.

 http://www.eset.com





special sorting

2010-11-29 Thread Papp Richard
Hello,

  I have many pages with the same content in the search result (the result
is the same for some of the cities from the same county)... which means that
I have duplicate content.

  the filter query is something like: +locationId:(60 26a 39a) - for city
with ID 60
  and I get the same result for city with ID 62: +locationId:(62 26a 39a)
(cityID, countyID, countryID)

  how could I use a sorting to have different docs order in results for
different cities?
  (for the same city I need to have the same sort order always - it cannot
be a simple random...)

  could I use somehow the cityID parameter as boost or score ? I tried but
could't realise too much.

thanks,
  Rich
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



Re: DIH causing shutdown hook executing?

2010-11-29 Thread Phong Dais
It is entirely possible that the server is asking solr to shutdown.  I'll
have to ask the admin.
I'm running Solr-1.4 inside of Jetty.  I definitely have enough disk space.
I think I did notice solr shutting down while it was idle.  I just
disregarded it as a fluke...  Perhaps there's something going on.
I will try to run this inside of tomcat and see what happens.

Not sure if this is related but I had to change the lockType to single
instead of the default native.
With native, I get a lock time out when starting up solr.  I also have
maxDocs set to 1.  I did not want to have millions of uncommitted
docs.
I'm running under Linux RedHat.

Regarding speed, the first million or so documents is done very quickly
(maybe 3 hrs) but after that, things slows down tremendously.

Thanks for the advice regarding solrj.  I'll definitely look into that.

P.


On Mon, Nov 29, 2010 at 2:39 PM, Erick Erickson erickerick...@gmail.comwrote:

 You're right, the OS is asking the server to shut down.  In the default
 example under Jetty, this is a result of issuing a crtl-c. Is it possible
 that something is asking your server to quit? What servlet container
 are you running under? Does the Solr server run for more than this
 period if you're NOT indexing? And are you sure you have enough
 resources, especially disk space?

 On another note, I'm surprised that it's taking 2 days to index 5m
 documents.
 That's less than 30 docs/second and Solr should handle a considerably
 greater load than that. For whatever that's worth...

 And what version of Solr are you using? You may want to consider
 writing something in SolrJ to do your indexing, it'll provide you more
 flexible control over indexing than DIH..

 Best
 Erick

 On Mon, Nov 29, 2010 at 1:20 PM, Phong Dais phong.gd...@gmail.com wrote:

  Hi,
 
  I am in the process of trying to index about 50 mil documents using the
  data
  import handler.
  For some reason, about 2 days into the import, I see this message
 shutdown
  hook executing in the log and the solr web server instance exits
  gracefully.
  I do not see any errors in the entire log.  This has happened twice now,
  usually 5 mil or so documents into the import process.
 
  Does anyone out there knows what this message mean?  It's an INFO log
  message so I don't think it is caused by any error.
  Does this problem occur because the os is asking the server to shut down
  (for whatever reason) or is there something wrong with the server causing
  it
  to shutdown?
 
  Thanks for any help,
  Phong
 



Re: special sorting

2010-11-29 Thread Tommaso Teofili
Perhaps, depending on your domain logic you could use function queries to
achieve that.
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Regards,
Tommaso

2010/11/29 Papp Richard ccode...@gmail.com

 Hello,

  I have many pages with the same content in the search result (the result
 is the same for some of the cities from the same county)... which means
 that
 I have duplicate content.

  the filter query is something like: +locationId:(60 26a 39a) - for city
 with ID 60
  and I get the same result for city with ID 62: +locationId:(62 26a 39a)
 (cityID, countyID, countryID)

  how could I use a sorting to have different docs order in results for
 different cities?
  (for the same city I need to have the same sort order always - it cannot
 be a simple random...)

  could I use somehow the cityID parameter as boost or score ? I tried but
 could't realise too much.

 thanks,
  Rich


 __ Information from ESET NOD32 Antivirus, version of virus
 signature
 database 5659 (20101129) __

 The message was checked by ESET NOD32 Antivirus.

 http://www.eset.com





Bad file descriptor Errors

2010-11-29 Thread John Williams
Recently, we have started to get Bad file descriptor errors in one of our 
Solr instances. This instance is a searcher and its index is stored on a local 
SSD. The master however has it's index stored on NFS, which seems to be working 
fine, currently. 

I have tried restarting tomcat and bringing over the index fresh from the 
master (via snappull/snapinstall). 

Any help would be greatly appreciated.

Thanks,
John


SEVERE: Exception during commit/optimize:java.lang.RuntimeException: 
java.io.FileNotFoundException: /u/solr/data/index/_w3vs.fnm (Bad file 
descriptor)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:371)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:512)
at org.apache.solr.core.SolrCore.update(SolrCore.java:771)
at 
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:53)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)

smime.p7s
Description: S/MIME cryptographic signature


Re: DIH causing shutdown hook executing?

2010-11-29 Thread Erick Erickson
Try without autocommit or bump the limit up considerably to see
if it changes the behavior. You should not be getting
this kind of performance hit after the first million  docs, so, it's
probably worth exploring.

See if you can find anything in your logs that indicates what's
hogging the critical resource maybe?

Best
Erick

On Mon, Nov 29, 2010 at 3:08 PM, Phong Dais phong.gd...@gmail.com wrote:

 It is entirely possible that the server is asking solr to shutdown.  I'll
 have to ask the admin.
 I'm running Solr-1.4 inside of Jetty.  I definitely have enough disk space.
 I think I did notice solr shutting down while it was idle.  I just
 disregarded it as a fluke...  Perhaps there's something going on.
 I will try to run this inside of tomcat and see what happens.

 Not sure if this is related but I had to change the lockType to single
 instead of the default native.
 With native, I get a lock time out when starting up solr.  I also have
 maxDocs set to 1.  I did not want to have millions of uncommitted
 docs.
 I'm running under Linux RedHat.

 Regarding speed, the first million or so documents is done very quickly
 (maybe 3 hrs) but after that, things slows down tremendously.

 Thanks for the advice regarding solrj.  I'll definitely look into that.

 P.


 On Mon, Nov 29, 2010 at 2:39 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  You're right, the OS is asking the server to shut down.  In the default
  example under Jetty, this is a result of issuing a crtl-c. Is it possible
  that something is asking your server to quit? What servlet container
  are you running under? Does the Solr server run for more than this
  period if you're NOT indexing? And are you sure you have enough
  resources, especially disk space?
 
  On another note, I'm surprised that it's taking 2 days to index 5m
  documents.
  That's less than 30 docs/second and Solr should handle a considerably
  greater load than that. For whatever that's worth...
 
  And what version of Solr are you using? You may want to consider
  writing something in SolrJ to do your indexing, it'll provide you more
  flexible control over indexing than DIH..
 
  Best
  Erick
 
  On Mon, Nov 29, 2010 at 1:20 PM, Phong Dais phong.gd...@gmail.com
 wrote:
 
   Hi,
  
   I am in the process of trying to index about 50 mil documents using the
   data
   import handler.
   For some reason, about 2 days into the import, I see this message
  shutdown
   hook executing in the log and the solr web server instance exits
   gracefully.
   I do not see any errors in the entire log.  This has happened twice
 now,
   usually 5 mil or so documents into the import process.
  
   Does anyone out there knows what this message mean?  It's an INFO log
   message so I don't think it is caused by any error.
   Does this problem occur because the os is asking the server to shut
 down
   (for whatever reason) or is there something wrong with the server
 causing
   it
   to shutdown?
  
   Thanks for any help,
   Phong
  
 



Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single body field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't quite
grasp what the whole solution would look like.

-- 
Jacob Elder
@jelder
(646) 535-3379


Termvector based result grouping / field collapsing?

2010-11-29 Thread Shawn Heisey
I was just in a meeting where we discussed customer feedback on our 
website.  One thing that the users would like to see is galleries 
where photos that are part of a set are grouped together under a single 
result.  This is basically field collapsing.


The problem I've got is that for most of our content, there's nothing to 
tie different photos together in a coherent way other than similar 
language in fields like the caption.  Is it feasible to use termvector 
information to automatically group documents with similar (but not 
identical) data in one or more fields?


Thanks,
Shawn



Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Markus Jelsma
You can use only one tokenizer per analyzer. You'd better use separate fields + 
fieldTypes for different languages.

 I am looking for a clear example of using more than one tokenizer for a
 source single field. My application has a single body field which until
 recently was all latin characters, but we're now encountering both English
 and Japanese words in a single message. Obviously, we need to be using CJK
 in addition to WhitespaceTokenizerFactory.
 
 I've found some references to using copyFields or NGrams but I can't quite
 grasp what the whole solution would look like.


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 You can use only one tokenizer per analyzer. You'd better use separate
 fields +
 fieldTypes for different languages.

  I am looking for a clear example of using more than one tokenizer for a
  source single field. My application has a single body field which until
  recently was all latin characters, but we're now encountering both
 English
  and Japanese words in a single message. Obviously, we need to be using
 CJK
  in addition to WhitespaceTokenizerFactory.
 
  I've found some references to using copyFields or NGrams but I can't
 quite
  grasp what the whole solution would look like.




-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder jel...@locamoda.com wrote:
 The problem is that the field is not guaranteed to contain just a single
 language. I'm looking for some way to pass it first through CJK, then
 Whitespace.

 If I'm totally off-target here, is there a recommended way of dealing with
 mixed-language fields?


maybe you should consider a tokenizer like StandardTokenizer, that
works reasonably well for most languages.


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
StandardTokenizer doesn't handle some of the tokens we need, like
@twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
Korean. Am I wrong about that?

On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir rcm...@gmail.com wrote:

 On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder jel...@locamoda.com wrote:
  The problem is that the field is not guaranteed to contain just a single
  language. I'm looking for some way to pass it first through CJK, then
  Whitespace.
 
  If I'm totally off-target here, is there a recommended way of dealing
 with
  mixed-language fields?
 

 maybe you should consider a tokenizer like StandardTokenizer, that
 works reasonably well for most languages.




-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder jel...@locamoda.com wrote:
 StandardTokenizer doesn't handle some of the tokens we need, like
 @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
 Korean. Am I wrong about that?

it uses the unigram method for CJK ideographs... the CJKtokenizer just
uses the bigram method, its just an alternative method.

the whitespace doesnt work at all though, so give up on that!


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jonathan Rochkind
You can only use one tokenizer on given field, I think. But a tokenizer 
isn't in fact the only thing that can tokenize, an ordinary filter can 
change tokenization too, so you could use two filters in a row.


You could also write your own custom tokenizer that does what you want, 
although I'm not entirely sure if you turn exactly what you say into 
code it will actually do what you want, I think it's more complicated, I 
think you'll need a tokenizer that looks for contiguous blocks of bytes 
that are UTF-8 CJK and does one thing to them, and contiguous blocks of 
bytes that are not UTF8 CJK and does another thing to them; rather than 
just first do one to the whole string and then do another.


Dealing with mixed language fields is tricky, I know of no general 
purpose good solutions, in part just because of the semantics involved.


If you have some strings for the field you know are CJK, adn others you 
know are English, the easiest thing to do is NOT put them in the same 
field, but put them in different fields, and use dismax (for example) to 
search both fields on query.  But if you can't even tell at index time 
which is which, or if you have strings that themselves include both CJK 
and English interspersed with each other, that might not work.


For my own case, where everything is just interspersed in the fields and 
I don't really know what language it is, here's what I do, which is 
definitely not great for CJK, but is better than nothing:


* As a tokenizer, I use the WhitespaceTokenizer.

* Then I apply a custom filter that looks for CJK chars, and 
re-tokenizes any CJK chars into one-token-per-char. This custom filter 
was written by someone other than me; it is open source; but I'm not 
sure if it's actually in a public repo, or how well documented it is.  I 
can put you in touch with the author to try and ask. There may also be a 
more standard filter other than the custom one I'm using that does the 
same thing?


Jonathan

Jonathan

On 11/29/2010 5:30 PM, Jacob Elder wrote:

The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
markus.jel...@openindex.iowrote:


You can use only one tokenizer per analyzer. You'd better use separate
fields +
fieldTypes for different languages.


I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single body field which until
recently was all latin characters, but we're now encountering both

English

and Japanese words in a single message. Obviously, we need to be using

CJK

in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't

quite

grasp what the whole solution would look like.





Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 * As a tokenizer, I use the WhitespaceTokenizer.

 * Then I apply a custom filter that looks for CJK chars, and re-tokenizes
 any CJK chars into one-token-per-char. This custom filter was written by
 someone other than me; it is open source; but I'm not sure if it's actually
 in a public repo, or how well documented it is.  I can put you in touch with
 the author to try and ask. There may also be a more standard filter other
 than the custom one I'm using that does the same thing?


You are describing what standardtokenizer does.


RE: solr admin

2010-11-29 Thread Papp Richard
in Solr admin (http://localhost:8180/services/admin/)
I can specify something like:

+category_id:200 +xxx:300

but how can I specify a sort option?

sort:category_id+asc

regards,
  Rich

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, November 29, 2010 22:00
To: solr-user@lucene.apache.org
Subject: Re: solr admin

I honestly don't understand what you're asking here. Specify what
in solr admin other than fields? what is it you're trying to accomplish?

Best
Erick

On Mon, Nov 29, 2010 at 2:56 PM, Papp Richard ccode...@gmail.com wrote:

 Hello,

  is there any way to specify in the solr admin other than fields? and I'm
 nt talking about the full interface which is also very limited.

  like: score, fl, fq, ...

  and yes, I know that I can use the url... which indeed is not too handy.

 thanks,
  Rich


 __ Information from ESET NOD32 Antivirus, version of virus
 signature
 database 5659 (20101129) __

 The message was checked by ESET NOD32 Antivirus.

 http://www.eset.com



 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



RE: special sorting

2010-11-29 Thread Papp Richard
Hmm, any clue how to use it? use the location_id somehow?

thanks,
  Rich

-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] 
Sent: Monday, November 29, 2010 22:08
To: solr-user@lucene.apache.org
Subject: Re: special sorting

Perhaps, depending on your domain logic you could use function queries to
achieve that.
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Regards,
Tommaso

2010/11/29 Papp Richard ccode...@gmail.com

 Hello,

  I have many pages with the same content in the search result (the result
 is the same for some of the cities from the same county)... which means
 that
 I have duplicate content.

  the filter query is something like: +locationId:(60 26a 39a) - for city
 with ID 60
  and I get the same result for city with ID 62: +locationId:(62 26a 39a)
 (cityID, countyID, countryID)

  how could I use a sorting to have different docs order in results for
 different cities?
  (for the same city I need to have the same sort order always - it cannot
 be a simple random...)

  could I use somehow the cityID parameter as boost or score ? I tried but
 could't realise too much.

 thanks,
  Rich


 __ Information from ESET NOD32 Antivirus, version of virus
 signature
 database 5659 (20101129) __

 The message was checked by ESET NOD32 Antivirus.

 http://www.eset.com



 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5659 (20101129) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



Re: Solr DataImportHandler (DIH) and Cassandra

2010-11-29 Thread Mark
The DataSource subclass route is what I will probably be interested in. 
Are there are working examples of this already out there?


On 11/29/10 12:32 PM, Aaron Morton wrote:

AFAIK there is nothing pre-written to pull the data out for you.

You should be able to create your DataSource sub class 
http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/DataSource.html Using 
the Hector java library to pull data from Cassandra.


I'm guessing you will need to consider how to perform delta imports. 
Perhaps using the secondary indexes in 0.7* , or maintaining your own 
queues or indexes to know what has changed.


There is also the Lucandra project, not exactly what your after but 
may be of interest anyway https://github.com/tjake/Lucandra


Hope that helps.
Aaron


On 30 Nov, 2010,at 05:04 AM, Mark static.void@gmail.com wrote:


Is there anyway to use DIH to import from Cassandra? Thanks


RE: solr admin

2010-11-29 Thread Ahmet Arslan
 in Solr admin (http://localhost:8180/services/admin/)
 I can specify something like:
 
 +category_id:200 +xxx:300
 
 but how can I specify a sort option?
 
 sort:category_id+asc

There is an [FULL INTERFACE] /admin/form.jsp link but it does not have sort 
option. It seems that you need to append it to your search url.


  


Re: solr admin

2010-11-29 Thread Yonik Seeley
On Mon, Nov 29, 2010 at 8:02 PM, Ahmet Arslan iori...@yahoo.com wrote:
 in Solr admin (http://localhost:8180/services/admin/)
 I can specify something like:

 +category_id:200 +xxx:300

 but how can I specify a sort option?

 sort:category_id+asc

 There is an [FULL INTERFACE] /admin/form.jsp link but it does not have sort 
 option. It seems that you need to append it to your search url.

Heh - yeah... that's an old interface, from the times when sort was
specified along with the query.
Can someone provide a patch to add a way to specify the sort?

-Yonik
http://www.lucidimagination.com


Re: Spell checking question from a Solr novice

2010-11-29 Thread Bill Dueber
On Mon, Oct 18, 2010 at 5:24 PM, Jason Blackerby jblacke...@gmail.comwrote:

 If you know the misspellings you could prevent them from being added to the
 dictionary with a StopFilterFactory like so:



Or, you know, correct the data :-)

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: question about Solr SignatureUpdateProcessorFactory

2010-11-29 Thread Chris Hostetter

: Why is also the field name (* above) added to the signature
: and not only the content of the field?
: 
: By purpose or by accident?

It was definitely deliberate.  This way if your signature fields are 
fieldA,fieldB,fieldC then these two documents...

Doc1:fielda:XXX
Doc1:fieldB:YYY

Doc2:fieldB:XXX
Doc2:fieldC:YYY

...don't wind up with identical signature alues

: I would like to suggest removing the field name from the signature and
: not mixing it up.

As mentioned, in the typical case it's important that the field names be 
included in the signature, but i imagine there would be cases where you 
wouldn't want them included (like a simple concat Signature for building 
basic composite keys)

I think the Signature API could definitely be enhanced to have additional 
methods for adding field names vs adding field values.

wanna open an issue in Jira sith some suggestions and use cases?


-Hoss


Re: search strangeness

2010-11-29 Thread ramzesua

Hi, Erick. There is defaultSearchField in my schema.xml. Can you give me your
example of configure for text field ?(What filters do you use for index and
for query)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1989466.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Shawn Heisey

On 11/29/2010 3:15 PM, Jacob Elder wrote:

I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single body field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.


What I'd like to see is a CJK filter that runs after tokenization 
(whitespace in my case) and doesn't do anything but handle the CJK 
characters.  If there are no CJK characters in the token, it should do 
nothing at all.  The CJK tokenizer does a whole host of other things 
that I want to handle myself.


Shawn