date:20100405

Solr also has a feature to stream from a local file rather than over
the network. The parameter
stream.file=/full/local/file/name.txt
means 'read this file from the local disk instead of the POST upload'.
Of course, you have to get the entire file onto the Solr indexer
machine (or a common file server).

http://wiki.apache.org/solr/UpdateRichDocuments#Parameters

On Thu, Apr 1, 2010 at 9:27 PM, Mark Fletcher
mark.fletcher2...@gmail.com wrote:
Hi Eric, Shawn,

Thank you for your reply.

Luckily just on the second time itself my 13GB SOLR XML (more than a million
docs) went in fine into SOLR without any problem and I uploaded another 2
more sets of 1.2million+ docs fine without any hassle.

I will try for lesser sized more xmls next time as well as the auto commit
suggestion.

Best Rgds,
Mark.

On Thu, Apr 1, 2010 at 6:18 PM, Shawn Smith sh...@thena.net wrote:

The error might be that your http client doesn't handle really large
files (32-bit overflow in the Content-Length header?) or something in
your network is killing your long-lived socket? Solr can definitely
accept a 13GB xml document.

I've uploaded large files into Solr successfully, including recently a
12GB XML input file with ~4 million documents. My Solr instance had
2GB of memory and it took about 2 hours. Solr streamed the XML in
nicely. I had to jump through a couple of hoops, but in my case it
was easier than writing a tool to split up my 12GB XML file...

1. I tried to use curl to do the upload, but it didn't handle files
that large. For my quick and dirty testing, netcat (nc) did the
trick--it doesn't buffer the file in memory and it doesn't overflow
the Content-Length header. Plus I could pipe the data through pv to
get a progress bar and estimated time of completion. Not recommended
for production!

FILE=documents.xml
SIZE=$(stat --format %s $FILE)
(echo POST /solr/update HTTP/1.1
Host: localhost:8983
Content-Type: text/xml
Content-Length: $SIZE
; cat $FILE ) | pv -s $SIZE | nc localhost 8983

2. Indexing seemed to use less memory if I configured Solr to auto
commit periodically in solrconfig.xml. This is what I used:

updateHandler class=solr.DirectUpdateHandler2
autoCommit
maxDocs25000/maxDocs !-- maximum uncommited docs
before autocommit triggered --
maxTime30/maxTime !-- 5 minutes, maximum time (in
MS) after adding a doc before an autocommit is triggered --
/autoCommit
/updateHandler

Shawn

On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson erickerick...@gmail.com
wrote:
Don't do that. For many reasons G. By trying to batch so many docs
together, you're just *asking* for trouble. Quite apart from whether
it'll
work once, having *any* HTTP-based protocol work reliably with 13G is
fragile...

For instance, I don't want to have my know whether the XML parsing in
SOLR parses the entire document into memory before processing or
not. But I sure don't want my application to change behavior if SOLR
changes it's mind and wants to process the other way. My perfectly
working application (assuming an event-driven parser) could
suddenly start requiring over 13G of memory... Oh my aching head!

Your specific error might even be dependent upon GCing, which will
cause it to break differently, sometimes, maybe..

So do break things up and transmit multiple documents. It'll save you
a world of hurt.

HTH
Erick

On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher
mark.fletcher2...@gmail.comwrote:

Hi,

For the first time I tried uploading a huge input SOLR xml having about
1.2
million *docs* (13GB in size). After some time I get the following
exception:-

uThe server encountered an internal error ([was class
java.net.SocketTimeoutException] Read timed out
java.lang.RuntimeException: [was class java.net.SocketTimeoutException]
Read
timed out
at

com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at

com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at

Re: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor

2010-04-05 Thread Andrew McCombe

Hi

Can no-one help me with this?

Andrew

On 2 April 2010 22:24, Andrew McCombe eupe...@gmail.com wrote:
 Hi

 I am experimenting with Solr to index my gmail and am experiencing an error:

 'Unable to load MailEntityProcessor or
 org.apache.solr.handler.dataimport.MailEntityProcessor'

 I downloaded a fresh 1.4 tgz, extracted it and added the following to
 example/solr/config/solrconfig.xml:


 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
    lst name=defaults
      str 
 name=config/home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml/str
    /lst
  /requestHandler

 email-data-config.xml containd the following:

 dataConfig
 document name=mailindex
   entity processor=MailEntityProcessor
           user=eupe...@gmail.com
           password=xx
           host=imap.gmail.com
           protocol=imaps
           folders = inbox/
 /document
 /dataConfig

 Whenever I try to import data using /dataimport?command=full-import I
 am seeing the error below:

 Apr 2, 2010 10:14:51 PM
 org.apache.solr.handler.dataimport.DataImporter doFullImport
 SEVERE: Full Import failed
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
 to load EntityProcessor implementation for entity:11418758786959
 Processing Document # 1
        at 
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
        at 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
        at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
        at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
 Caused by: java.lang.ClassNotFoundException: Unable to load
 MailEntityProcessor or
 org.apache.solr.handler.dataimport.MailEntityProcessor
        at 
 org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802)
        ... 6 more
 Caused by: org.apache.solr.common.SolrException: Error loading class
 'MailEntityProcessor'
        at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956)
        ... 7 more
 Caused by: java.lang.ClassNotFoundException: MailEntityProcessor
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
        ... 8 more
 Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
 INFO: start rollback
 Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
 INFO: end_rollback


 Am I missing a step somewhere? I have tried this with the standard
 apache 1.4, a nightly of 1.5 and also the LucidWorks release and get
 the same issue with each.  The wiki isn't very detailed either. My
 backbground isn't in Java so a lot of this is new to me.


 Regards
 Andrew McCombe

Re: Experience with indexing billions of documents?

The 2B limitation is within one shard, due to using a signed 32-bit
integer. There is no limit in that regard in sharding- Distributed
Search uses the stored unique document id rather than the internal
docid.

On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens richcari...@gmail.com wrote:
 A colleague of mine is using native Lucene + some home-grown
 patches/optimizations to index over 13B small documents in a 32-shard
 environment, which is around 406M docs per shard.

 If there's a 2B doc id limitation in Lucene then I assume he's patched it
 himself.

 On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote:

 My guess is that you will need to take advantage of Solr 1.5's upcoming
 cloud/cluster renovations and use multiple indexes to comfortably achieve
 those numbers. Hypthetically, in that case, you won't be limited by single
 index docid limitations of Lucene.

  We are currently indexing 5 million books in Solr, scaling up over the
  next few years to 20 million.  However we are using the entire book as a
  Solr document.  We are evaluating the possibility of indexing individual
  pages as there are some use cases where users want the most relevant
 pages
  regardless of what book they occur in.  However, we estimate that we are
  talking about somewhere between 1 and 6 billion pages and have concerns
  over whether Solr will scale to this level.
 
  Does anyone have experience using Solr with 1-6 billion Solr documents?
 
  The lucene file format document
  (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
  mentions a limit of about 2 billion document ids.   I assume this is the
  lucene internal document id and would therefore be a per index/per shard
  limit.  Is this correct?
 
 
  Tom Burton-West.
 
 
 
 






-- 
Lance Norskog
goks...@gmail.com

Re: Index db data

2010-04-05 Thread MitchK


It seems to work ;).

However, trueman, you should subscribe to solr-user@lucene.apache.org, since 
not everybody looks up Nabble for mailing-list postings. 

- Mitch
-- 
View this message in context: 
http://n3.nabble.com/Index-db-data-tp693204p698691.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr caches and nearly static indexes

In a word: no.

What you can do instead of deleting them is to add them to a growing
list of don't search for these documents. This could be listed in a
filter query.

We had exactly this problem in a consumer app; we had a small but
continuously growing list of obscene documents in the index, and did
not want to display these. So, we had a filter query with all of the
obscene words, and used this with every query.

Lance

On Fri, Apr 2, 2010 at 6:34 PM, Shawn Heisey s...@elyograg.org wrote:
 My index has a number of shards that are nearly static, each with about 7
 million documents.  By nearly static, I mean that the only changes that
 normally happen to them are document deletions, done with the xml update
 handler.  The process that does these deletions runs once every two minutes,
 and does them with a query on a field other than the one that's used for
 uniqueKey.  Once a day, I will be adding data to these indexes with the DIH
 delta-import.  One of my shards gets all new data once every two minutes,
 but it is less than 5% the size of the others.

 The problem that I'm running into is that every time a delete is committed,
 my caches are suddenly invalid and I seem to have two options: Spend a lot
 of time and I/O rewarming them, or suffer with slow (3 seconds or longer)
 search times.  Is there any way to have the index keep its caches when the
 only thing that happens is deletions, then invalidate them when it's time to
 actually add data?  It would have to be something I can dynamically change
 when switching between deletions and the daily import.

 Thanks,
 Shawn





-- 
Lance Norskog
goks...@gmail.com

Some help for folks trying to get new Solr/Lucene up in Eclipse

2010-04-05 Thread Mattmann, Chris A (388J)

Hey All,

Just to save some folks some time in case you are trying to get new
Lucene/Solr up in running in Eclipse. If you continue to get weird errors,
e.g., in solr/src/test/TestConfig.java regarding
org.w3c.dom.Node#getTextContent(), I found for me this error was caused by
including the Tidy.jar (which includes its own version of the Node API) in
the build path. If you take that out, you should be good.

Wanted to pass that along.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: Obtaining SOLR index size on disk

This information is not available via the API. If you would like this
information added to the statistics request, please file a JIRA
requesting it.

Without knowing the size of the index files to be transferred, the
client cannot monitor its own disk space. This would be useful for the
cloud management features.

On Mon, Apr 5, 2010 at 5:35 AM, Na_D nabam...@zaloni.com wrote:

  hi,

   I am using the piece of code given below

          ReplicationHandler handler2 = new ReplicationHandler();
                 System.out.println( handler2.getDescription());


                 NamedList statistics = handler2.getStatistics();
                 System.out.println(Statistics   + statistics);

 The result that i am getting (ie the printed statment is :
 Statistics
 {handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN}


 But the Statistics consists of the other info too:

 class
        org.apache.solr.handler.ReplicationHandler
      /class
      version
        $Revision: 829682 $
      /version

      description
        ReplicationHandler provides replication of index and configuration
 files from Master to Slaves
      /description
      stats

        stat name=handlerStart 
          1270463612968
        /stat

        stat name=requests 
          0
        /stat

        stat name=errors 
          0
        /stat

        stat name=timeouts 
          0
        /stat

        stat name=totalTime 
          0
        /stat

        stat name=avgTimePerRequest 
          NaN
        /stat

        stat name=avgRequestsPerSecond 
          0.0
        /stat

        stat name=indexSize 
          19.29 KB
        /stat

        stat name=indexVersion 
          1266984293131
        /stat

        stat name=generation 
          3
        /stat

        stat name=indexPath 
          C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index
        /stat

        stat name=isMaster 
          true
        /stat

        stat name=isSlave 
          false
        /stat

        stat name=confFilesToReplicate 
          schema.xml,stopwords.txt,elevate.xml
        /stat

        stat name=replicateAfter 
          [commit, startup]
        /stat

        stat name=replicationEnabled 
          true
        /stat

      /stats
    /entry



 this is where the problem lies : i need the size of the index im not finding
 the API
 nor is the statistics printing out(sysout) the same.
 how to i get the size of the index
 --
 View this message in context: 
 http://n3.nabble.com/Obtaining-SOLR-index-size-on-disk-tp500095p697599.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com

Re: Minimum Should Match the other way round

2010-04-05 Thread MitchK


Sorry for doubleposting, but to avoid any missunderstanding: 
Accessing instantiated filters is not a really good idea, since a new Filter
must be instantiated all the time. However, what I have ment was: if I
create a WordDelimiterFilter or a StopFilter and I have set a param for a
file like stopwords.txt or protwords.txt, I want to access those (as I
understood cached) ressources. 

- Mitch
-- 
View this message in context: 
http://n3.nabble.com/Minimum-Should-Match-the-other-way-round-tp694867p698796.html
Sent from the Solr - User mailing list archive at Nabble.com.

one particular doc in results should always come first for a particular query

2010-04-05 Thread Mark Fletcher

Hi,

Suppose I search for the word  *international. *A particular record (say *
recordX*) I am looking for is coming as the Nth result now.
I have a requirement that when a user queries for *international *I need
recordX to always be the first result. How can I achieve this.

Note:- When user searches with a *different* keyword, *recordX*  need not be
the expected first result record; it may be a different record that has to
be made to come as the first in the result for that keyword.

Is there a way to achieve this requirement. I am using dismax.

Thanks in advance.

BR,
Mark

Re: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor

The MailEntityProcessor is an extra and does not come normally with
the DataImportHandler. The wiki page should mention this.

In the Solr distribution it should be in the dist/ directory as
dist/apache-solr-dataimporthandler-extras-1.4.jar. The class it wants
is in this jar . (Do 'unzip -l jar' to find the classes inside a jar.)

You have to make a lib/ directory in the Solr core you are using, and
copy this jar into there.

On Mon, Apr 5, 2010 at 1:15 PM, Andrew McCombe eupe...@gmail.com wrote:
 Hi

 Can no-one help me with this?

 Andrew

 On 2 April 2010 22:24, Andrew McCombe eupe...@gmail.com wrote:
 Hi

 I am experimenting with Solr to index my gmail and am experiencing an error:

 'Unable to load MailEntityProcessor or
 org.apache.solr.handler.dataimport.MailEntityProcessor'

 I downloaded a fresh 1.4 tgz, extracted it and added the following to
 example/solr/config/solrconfig.xml:


 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
    lst name=defaults
      str 
 name=config/home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml/str
    /lst
  /requestHandler

 email-data-config.xml containd the following:

 dataConfig
 document name=mailindex
   entity processor=MailEntityProcessor
           user=eupe...@gmail.com
           password=xx
           host=imap.gmail.com
           protocol=imaps
           folders = inbox/
 /document
 /dataConfig

 Whenever I try to import data using /dataimport?command=full-import I
 am seeing the error below:

 Apr 2, 2010 10:14:51 PM
 org.apache.solr.handler.dataimport.DataImporter doFullImport
 SEVERE: Full Import failed
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
 to load EntityProcessor implementation for entity:11418758786959
 Processing Document # 1
        at 
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
        at 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
        at 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
        at 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
 Caused by: java.lang.ClassNotFoundException: Unable to load
 MailEntityProcessor or
 org.apache.solr.handler.dataimport.MailEntityProcessor
        at 
 org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802)
        ... 6 more
 Caused by: org.apache.solr.common.SolrException: Error loading class
 'MailEntityProcessor'
        at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
        at 
 org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956)
        ... 7 more
 Caused by: java.lang.ClassNotFoundException: MailEntityProcessor
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
        ... 8 more
 Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
 INFO: start rollback
 Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
 INFO: end_rollback


 Am I missing a step somewhere? I have tried this with the standard
 apache 1.4, a nightly of 1.5 and also the LucidWorks release and get
 the same issue with each.  The wiki isn't very detailed either. My
 backbground isn't in Java so a lot of this is new to me.


 Regards
 Andrew McCombe





-- 
Lance Norskog
goks...@gmail.com

Re: including external files in config by corename

Making snippets is part of highlighting.

http://www.lucidimagination.com/search/s:lucid/li:cdrg?q=snippet

On Mon, Apr 5, 2010 at 10:53 AM, Shawn Heisey s...@elyograg.org wrote:
 Is it possible to access the core name in a config file (such as
 solrconfig.xml) so I can include core-specific configlets into a common
 config file?  I would like to pull in different configurations for things
 like shards and replication, but have all the cores otherwise use an
 identical config file.

 Also, I have been looking for the syntax to include a snippet and haven't
 turned anything up yet.

 Thanks,
 Shawn





-- 
Lance Norskog
goks...@gmail.com

Re: no of cfs files are more that the mergeFactor

mergeFactor=5 means that if there are 42 documents, there will be 3 index files:

1 with 25 documents,
3 with 5 documents, and
1 with 2 documents

Imagine making change with coins of 1 document, 5 documents, 5^2
documents, 5^3 documents, etc.

On Mon, Apr 5, 2010 at 10:59 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 This sounds completley normal form what i remembe about mergeFactor.

 Segmenets are merged by level meaning that with a mergeFactor of 5, once
 5 level 1 segments are formed they are merged into a single level 2
 segment.  then 5 more level 1 segments are allowed to form before the
 next merge (resulting in 2 legel 2 sements).  Once you have 5 level 2
 sements, then they are all merged into a single level 3 segment, etc...

 : I had my mergeFactor as 5 ,
 : but when i load a data with some 1,00,000 i got some 12 .cfs files in my
 : data/index folder .
 :
 : How come this is possible .
 : in what context we can have more no of .cfs files


 -Hoss





-- 
Lance Norskog
goks...@gmail.com

exact match coming as second record

2010-04-05 Thread Mark Fletcher

Hi,

I am using the dismax handler.
I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have
boosted myfield^20.0.
Even with such a high boost (in fact among the qf fields specified this
field has the max boost given), when I search for XXX.YYY.ZZZ I see my
record as the second one in the results and a record of  the form
XXX.YYY.ZZZ.AAA.BBB is appearing as the first one.

Can any one help me understand why is this so, as I thought an exact match
on a heavily boosted field would give the exact match record first in
dismax.

Thanks and Rgds,
Mark

Re: one particular doc in results should always come first for a particular query

2010-04-05 Thread Erick Erickson

Hmmm, how do you know which particular record corresponds to which keyword?
Is this a list known at index time, as in this record should come up first
whenever bonkers is the keyword?

If that's the case, you could copy the magic keyword to a different field
(say magic_keyword) and boost it right into orbit as an OR clause
(magic_keyword:bonkers ^1). This kind of assumes that a magic keyword
corresponds to one and only one document

If this is way off base, perhaps you could characterize how keywords map to
specific documents you want at the top.

Best
Erick

P.S. It threw me for a minute when you used asterisks (*) for emphasis, it's
easily confused with wildcards.

On Mon, Apr 5, 2010 at 5:30 PM, Mark Fletcher
mark.fletcher2...@gmail.comwrote:

 Hi,

 Suppose I search for the word  *international. *A particular record (say *
 recordX*) I am looking for is coming as the Nth result now.
 I have a requirement that when a user queries for *international *I need
 recordX to always be the first result. How can I achieve this.

 Note:- When user searches with a *different* keyword, *recordX*  need not
 be
 the expected first result record; it may be a different record that has to
 be made to come as the first in the result for that keyword.

 Is there a way to achieve this requirement. I am using dismax.

 Thanks in advance.

 BR,
 Mark

Re: exact match coming as second record

2010-04-05 Thread Erick Erickson

What do you get back when you specify debugQuery=on?

Best
Erick

On Mon, Apr 5, 2010 at 7:31 PM, Mark Fletcher
mark.fletcher2...@gmail.comwrote:

 Hi,

 I am using the dismax handler.
 I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have
 boosted myfield^20.0.
 Even with such a high boost (in fact among the qf fields specified this
 field has the max boost given), when I search for XXX.YYY.ZZZ I see my
 record as the second one in the results and a record of  the form
 XXX.YYY.ZZZ.AAA.BBB is appearing as the first one.

 Can any one help me understand why is this so, as I thought an exact match
 on a heavily boosted field would give the exact match record first in
 dismax.

 Thanks and Rgds,
 Mark

Re: one particular doc in results should always come first for a particular query


: If that's the case, you could copy the magic keyword to a different field
: (say magic_keyword) and boost it right into orbit as an OR clause
: (magic_keyword:bonkers ^1). This kind of assumes that a magic keyword
: corresponds to one and only one document
: 
: If this is way off base, perhaps you could characterize how keywords map to
: specific documents you want at the top.

This smells like...

http://wiki.apache.org/solr/QueryElevationComponent

-Hoss

Re: Multicore and TermVectors


: Subject: Multicore and TermVectors

It doesn't sound like Multicore is your issue ... it seems like what you 
mean is that you are using distributed search with TermVectors, and that 
is causing a problem.  Can you please clarify exactly what you mean ... 
describe your exact setup (ie: how manay machines, how many solr ports 
running on each of those machines, what the solr.xml looks like on each of 
those ports, how many SolrCores running in each of those ports, what 
the slrconfig.xml looks like for each of those instances, which instances 
coordinate distributed searches of which shards, what urls your client 
hits, what URLs get hit on each of your shards (according to the logs) as 
a result, etc... 

details, details, details.


-Hoss

Re: Solr caches and nearly static indexes


: times.  Is there any way to have the index keep its caches when the only thing
: that happens is deletions, then invalidate them when it's time to actually add
: data?  It would have to be something I can dynamically change when switching
: between deletions and the daily import.

The problem is a delete is a genuine hange that invalidates hte cache 
objects.  The worst case is the QueryResultCache where a deleted doc would 
require shifting all of hte other docs up in any result set that it 
matched on -- even if that doc isn't in the actual DocSlice that's cached 
(ie: the cached version of results 50-100 is affected by deleting a doc 
from 1-50)

In theory something like the filterCache could be warmed by copying 
entires from the old cache and just unsetting the bits corrisponding to 
the deleted docs -- except that i'm pretty sure even if all you do is 
delete some docs, a MergePolicy *could* decide to merge segments and 
collapse away the docids of the delete docs.


-Hoss

Re: Solr caches and nearly static indexes


: We had exactly this problem in a consumer app; we had a small but
: continuously growing list of obscene documents in the index, and did
: not want to display these. So, we had a filter query with all of the
: obscene words, and used this with every query.

that doesn't seem like it would really help with the caching issue ... the 
reusing the FieldCache seems like hte only thing that would be 
advantageous in that case, the filterCache and queryResultCache are going 
to have a low cache hit rate as the filter queries involved keep changing 
as new doc eys get added to the filter query.

or am i completely missunderstanding how you had this working?



-Hoss

Re: Solr caches and nearly static indexes

2010-04-05 Thread Yonik Seeley

On Mon, Apr 5, 2010 at 9:04 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 ... the reusing the FieldCache seems like hte only thing that would be
 advantageous in that case

And FieldCache entries are currently reused when there have only been
deletions on a segment (since Solr 1.4).

-Yonik
http://www.lucidimagination.com

Re: Solr caches and nearly static indexes



:  ... the reusing the FieldCache seems like hte only thing that would be
:  advantageous in that case
: 
: And FieldCache entries are currently reused when there have only been
: deletions on a segment (since Solr 1.4).

But that's kind of orthoginal to (what i think) Lance's point was: that 
instead of deleting docs and open a new searcher, you could instead just 
add the doc keys to a (negated) filter query (and never open a new 
searcher at all)




-Hoss

Re: Solr caches and nearly static indexes

2010-04-05 Thread Yonik Seeley

On Mon, Apr 5, 2010 at 9:10 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:


 :  ... the reusing the FieldCache seems like hte only thing that would be
 :  advantageous in that case
 :
 : And FieldCache entries are currently reused when there have only been
 : deletions on a segment (since Solr 1.4).

 But that's kind of orthogina

Yeah - just coming into the middle and pointing out the FieldCache
reuse thing (which is new for 1.4).

l to (what i think) Lance's point was: that
 instead of deleting docs and open a new searcher, you could instead just
 add the doc keys to a (negated) filter query (and never open a new
 searcher at all)

I guess as long as you versioned the filter that could work.
It would have the effect of invalidating all of the query cache, but
wouldn't affect the filter cache.

-Yonik
http://www.lucidimagination.com

Re: exact match coming as second record

2010-04-05 Thread Mark Fletcher

Hi Eric,

Thanks many for your mail!
Please find attached the debugQuery results.

Thanks!
Mark

On Mon, Apr 5, 2010 at 7:38 PM, Erick Erickson erickerick...@gmail.comwrote:

 What do you get back when you specify debugQuery=on?

 Best
 Erick

 On Mon, Apr 5, 2010 at 7:31 PM, Mark Fletcher
 mark.fletcher2...@gmail.comwrote:

   Hi,
 
  I am using the dismax handler.
  I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have
  boosted myfield^20.0.
  Even with such a high boost (in fact among the qf fields specified this
  field has the max boost given), when I search for XXX.YYY.ZZZ I see my
  record as the second one in the results and a record of  the form
  XXX.YYY.ZZZ.AAA.BBB is appearing as the first one.
 
  Can any one help me understand why is this so, as I thought an exact
 match
  on a heavily boosted field would give the exact match record first in
  dismax.
 
  Thanks and Rgds,
  Mark
 

A personal note:-
I have boosted the id field to the highest among my qf values specified in my 
dismax. 
Even then when I search for an id say XX.YYY.ZZZ, instead of pushing the record 
with id=XX.YYY.ZZZ to the first place, it is displaying another record 
XX.YYY.ZZZ.ME.PK as the first one...There are total 4 results but I have 
included details of only the first and second. Am surprised why XX.YY.ZZZ 
doesn't come as the first record even after an exact match found in it.

My qf fields in dismax:-
 str name=qf
name^10.0 id^20.0 subtopic1^1.0 indicator_value^1.0 country_name^1.0 
country_code^1.0 source^0.8 database^1.4 definition^1.2 dr_report_name^1.0 
dr_header^1.0 dr_footer^1.0 dr_mdx_query^1.0 dr_reportmetadata^1.0 content^1.0 
aag_indicators^1.0 type^1.0 text^.3
 /str
str name=pf
id^6.0
 /str
 str name=bq
type:Timeseries^1000.0
 /str

Debug Report:-

lst name=debug
 str name=rawquerystringxx.yyy./str
 str name=querystringxx.yyy./str
 str name=parsedquery+DisjunctionMaxQuery((text:(xx.yyy.zzz xx) yyy 
^0.3 | definition:(xx.yyy.zzz xx) yyy ^0.2 | 
indicator_value:(xx.yyy.zzz xx) yyy  | subtopic1:(xx.yyy.zzz xx) yyy 
 | dr_report_name:(xx.yyy.zzz xx) yyy  | 
dr_reportmetadata:(xx.yyy.zzz xx) yyy  | dr_footer:(xx.yyy.zzz xx) yyy 
 | type:(xx.yyy.zzz xx) yyy  | country_code:(xx.yyy.zzz xx) yyy 
^2.0 | country_name:(xx.yyy.zzz xx) yyy ^2.0 | database:(xx.yyy.zzz 
xx) yyy ^1.4 | aag_indicators:(xx.yyy.zzz xx) yyy  | 
content:(xx.yyy.zzz xx) yyy  | id:xx.yyy.^1000.0 | 
dr_mdx_query:(xx.yyy.zzz xx) yyy  | source:(xx.yyy.zzz xx) yyy ^0.2 
| name:(xx.yyy.zzz xx) yyy ^10.0 | dr_header:(xx.yyy.zzz xx) yyy 
)~0.01) DisjunctionMaxQuery((id:xx.yyy.^6.0)~0.01) 
type:timeseries^1000.0/str
 str name=parsedquery_toString+(text:(xx.yyy.zzz xx) yyy ^0.3 | 
definition:(xx.yyy.zzz xx) yyy ^0.2 | indicator_value:(xx.yyy.zzz xx) 
yyy  | subtopic1:(xx.yyy.zzz xx) yyy  | dr_report_name:(xx.yyy.zzz 
xx) yyy  | dr_reportmetadata:(xx.yyy.zzz xx) yyy  | 
dr_footer:(xx.yyy.zzz xx) yyy  | type:(xx.yyy.zzz xx) yyy  | 
country_code:(xx.yyy.zzz xx) yyy ^2.0 | country_name:(xx.yyy.zzz xx) yyy 
^2.0 | database:(xx.yyy.zzz xx) yyy ^1.4 | 
aag_indicators:(xx.yyy.zzz xx) yyy  | content:(xx.yyy.zzz xx) yyy  
| id:xx.yyy.^1000.0 | dr_mdx_query:(xx.yyy.zzz xx) yyy  | 
source:(xx.yyy.zzz xx) yyy ^0.2 | name:(xx.yyy.zzz xx) yyy ^10.0 | 
dr_header:(xx.yyy.zzz xx) yyy )~0.01 (id:xx.yyy.^6.0)~0.01 
type:timeseries^1000.0/str
 lst name=explain
str name=XX.YYY..ME.PK
0.15786289 = (MATCH) sum of:
  6.086512E-4 = (MATCH) max plus 0.01 times others of:
6.086512E-4 = (MATCH) weight(text:(xx.yyy. sp) yyy ^0.3 in 1004), 
product of:
  7.562088E-4 = queryWeight(text:(xx.yyy. xx) yyy ^0.3), product 
of:
0.3 = boost
20.604721 = idf(text:(xx.yyy. xx) yyy ^0.3)
1.2233584E-4 = queryNorm
  0.8048719 = (MATCH) fieldWeight(text:(xx.yyy. xx) yyy ^0.3 in 
1004), product of:
1.0 = tf(phraseFreq=1.0)
20.604721 = idf(text:(xx.yyy. xx) yyy ^0.3)
0.0390625 = fieldNorm(field=text, doc=1004)
  0.15725423 = (MATCH) weight(type:timeseries^1000.0 in 1004), product of:
0.1387005 = queryWeight(type:timeseries^1000.0), product of:
  1000.0 = boost
  1.1337683 = idf(docFreq=1054, maxDocs=1206)
  1.2233584E-4 = queryNorm
1.1337683 = (MATCH) fieldWeight(type:timeseries in 1004), product of:
  1.0 = tf(termFreq(type:timeseries)=1)
  1.1337683 = idf(docFreq=1054, maxDocs=1206)
  1.0 = fieldNorm(field=type, doc=1004)
/str
  str name=XX.YYY.
0.15774116 = (MATCH) sum of:
  4.8692097E-4 = (MATCH) max plus 0.01 times others of:
4.8692097E-4 = (MATCH) weight(text:(xx.yyy. xx) yyy ^0.3 in 
1003), product of:
  7.562088E-4 = queryWeight(text:(xx.yyy. xx) yyy ^0.3), product 
of:
0.3 = boost
20.604721 =

Re: including external files in config by corename


On 04/05/2010 01:53 PM, Shawn Heisey wrote:
Is it possible to access the core name in a config file (such as 
solrconfig.xml) so I can include core-specific configlets into a 
common config file?  I would like to pull in different configurations 
for things like shards and replication, but have all the cores 
otherwise use an identical config file.


Also, I have been looking for the syntax to include a snippet and 
haven't turned anything up yet.


Thanks,
Shawn



The best you have to work with at the moment is Xincludes:

http://wiki.apache.org/solr/SolrConfigXml#XInclude

and System Property Substitution:

http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

--
- Mark

http://www.lucidimagination.com

Re: Need info on CachedSQLentity processor


On 04/05/2010 02:28 PM, bbarani wrote:

Hi,

I am using cachedSqlEntityprocessor in DIH to index the data. Please find
below my dataconfig structure,

entity x query=select * from x  ---  object
entity y query=select * from y processor=cachedSqlEntityprocessor
cachekey=y.id cachevalue=x.id  --  object properties

For each and every object I would be retrieveing corresponding object
properties (in my subqueries).

I get in to OOM very often and I think thats a trade off if I use
cachedSqlEntityprocessor.

My assumption is that when I use cachedSqlEntityprocessor the indexing
happens as follows,

First entity x will get executed and the entire table gets stored in cache

next entity y gets executed and entire table gets stored in cache

Finally the compasion heppens through hash map .

So always I need to have the memory allocated to SOLR JVM more than or equal
to the data present in tables?


Now my final question is that even after SOLR complexes indexing the memory
used previously is not getting released. I could still see the JVM consuming
1.5 GB after the indexing completes. I tried to use Java hotspot options but
didnt see any differences..

Any thoughts / confirmation on my assumptions above would be of great help
to me to get in to  a decision of choosing cachedSqlEntityprocessor or not.

Thanks,
BB



   


You are right - CacheSQLEntityProcessor: the cache is an unbounded 
HashMap, with no option to bound it.


IMO this should be fixed - want to make a JIRA issue? I've brought it up 
on the list before, but I don't think I ever got around to making an issue.


As to why its not getting released - that is odd. Perhaps a GC has just 
not been triggered yet and it will be released? If not, that's a pretty 
nasty bug. Can you try forcing a GC to see? (say with jconsole?)


--
- Mark

http://www.lucidimagination.com

Re: including external files in config by corename


: The best you have to work with at the moment is Xincludes:
: 
: http://wiki.apache.org/solr/SolrConfigXml#XInclude
: 
: and System Property Substitution:
: 
: http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

Except that XInclude is a feature of hte XML parser, while property 
substitution is soemthing Solr does after the XML has been parsed into a 
DOM -- so you can't have an XInclude of a fle whose name is determined by 
a property (like the core name)

what you cna do however, is have a distinct solrconfig.xml for each core, 
which is just a thin shell that uses XInclude to include big chunkcs of 
frequently reused declarations, and some cores can exclude some of thes 
includes.  (ie: turn the problem inside out)



-Hoss

Re: Some help for folks trying to get new Solr/Lucene up in Eclipse

I had a slight hiccup that I just ignored. Even when I used Java 1.6
JDK mode, Eclipse did not know this method. I had to comment out the
three places that use this method.

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(true)

Lance Norskog

On Mon, Apr 5, 2010 at 1:49 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hey All,

 Just to save some folks some time in case you are trying to get new
 Lucene/Solr up in running in Eclipse. If you continue to get weird errors,
 e.g., in solr/src/test/TestConfig.java regarding
 org.w3c.dom.Node#getTextContent(), I found for me this error was caused by
 including the Tidy.jar (which includes its own version of the Node API) in
 the build path. If you take that out, you should be good.

 Wanted to pass that along.

 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






-- 
Lance Norskog
goks...@gmail.com

Re: Need info on CachedSQLentity processor

2010-04-05 Thread bbarani


Mark,

I have opened a JIRA issue - https://issues.apache.org/jira/browse/SOLR-1867

Thanks,
Barani
-- 
View this message in context: 
http://n3.nabble.com/Need-info-on-CachedSQLentity-processor-tp698418p699329.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multicore and TermVectors

There is no query parameter. The query parser throws an NPE if there
is no query parameter:

http://issues.apache.org/jira/browse/SOLR-435

It does not look like term vectors are processed in distributed search anyway.

On Mon, Apr 5, 2010 at 4:45 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Subject: Multicore and TermVectors

 It doesn't sound like Multicore is your issue ... it seems like what you
 mean is that you are using distributed search with TermVectors, and that
 is causing a problem.  Can you please clarify exactly what you mean ...
 describe your exact setup (ie: how manay machines, how many solr ports
 running on each of those machines, what the solr.xml looks like on each of
 those ports, how many SolrCores running in each of those ports, what
 the slrconfig.xml looks like for each of those instances, which instances
 coordinate distributed searches of which shards, what urls your client
 hits, what URLs get hit on each of your shards (according to the logs) as
 a result, etc...

 details, details, details.


 -Hoss





-- 
Lance Norskog
goks...@gmail.com

Re: including external files in config by corename


On 04/05/2010 10:12 PM, Chris Hostetter wrote:

: The best you have to work with at the moment is Xincludes:
:
: http://wiki.apache.org/solr/SolrConfigXml#XInclude
:
: and System Property Substitution:
:
: http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution

Except that XInclude is a feature of hte XML parser, while property
substitution is soemthing Solr does after the XML has been parsed into a
DOM -- so you can't have an XInclude of a fle whose name is determined by
a property (like the core name

Didn't suggest he could - just giving him the features he has to work with.

--
- Mark

http://www.lucidimagination.com

What does it mean when you see a plus sign in between two words inside synonyms.txt?

2010-04-05 Thread paulosalamat


Hi I'm new to this group,

I would like to ask a question:

What does it mean when you see a plus sign in between two words inside
synonyms.txt?

e.g. 

macbookair = macbook+air

Thanks,
Paulo
-- 
View this message in context: 
http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697235.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?

2010-04-05 Thread Koji Sekiguchi


paulosalamat wrote:

Hi I'm new to this group,

I would like to ask a question:

What does it mean when you see a plus sign in between two words inside
synonyms.txt?

e.g. 


macbookair = macbook+air

Thanks,
Paulo
  

Welcome, Paulo!

It depends on your tokenizer. You can specify a tokenizer via
tokenizerFactory attribute when you use SynonymFilterFactory.
The tokenizer is used when SynonymFilterFactory reads the
synonyms.txt. If you do not specify it, WhitespaceTokenizer
will be used as default.

In the above example, the term text macbookair will be
normalized to the term text macbook+air, if WhitespaceTokenizer
is used.

Koji

--
http://www.rondhuit.com/en/

Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?

2010-04-05 Thread paulosalamat

Hi Koji,

Thank you for the reply.

I have another question. If WhitespaceTokenizer is used, is the term text
macbook+air equal to macbook air?

Thank you,
Paulo

On Mon, Apr 5, 2010 at 5:50 PM, Koji Sekiguchi [via Lucene]
ml-node+697386-2142071620-218...@n3.nabble.comml-node%2b697386-2142071620-218...@n3.nabble.com
wrote:

paulosalamat wrote:

Hi I'm new to this group,

I would like to ask a question:

What does it mean when you see a plus sign in between two words inside
synonyms.txt?

e.g.

macbookair = macbook+air

Thanks,
Paulo

Welcome, Paulo!

It depends on your tokenizer. You can specify a tokenizer via
tokenizerFactory attribute when you use SynonymFilterFactory.
The tokenizer is used when SynonymFilterFactory reads the
synonyms.txt. If you do not specify it, WhitespaceTokenizer
will be used as default.

In the above example, the term text macbookair will be
normalized to the term text macbook+air, if WhitespaceTokenizer
is used.

Koji

--
http://www.rondhuit.com/en/

--
View message @
http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697386.html
To unsubscribe from What does it mean when you see a plus sign in between
two words inside synonyms.txt?, click here (link removed) ==.

--
View this message in context:
http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697403.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?

2010-04-05 Thread Koji Sekiguchi


paulosalamat wrote:

Hi Koji,

Thank you for the reply.

I have another question. If WhitespaceTokenizer is used, is the term text
macbook+air equal to macbook air?
  

No. In the field, macbook air will be a phrase (not a term).
You can define not only terms but phrases in synonyms.txt:

ex)
macbookair = macbook air

Koji

--
http://www.rondhuit.com/en/

Re: Obtaining SOLR index size on disk

2010-04-05 Thread Na_D


  hi,
 
   I am using the piece of code given below
 
  ReplicationHandler handler2 = new ReplicationHandler();
 System.out.println( handler2.getDescription());
 
 
 NamedList statistics = handler2.getStatistics();
 System.out.println(Statistics   + statistics);

The result that i am getting (ie the printed statment is :
Statistics  
{handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN}


But the Statistics consists of the other info too:

class
org.apache.solr.handler.ReplicationHandler
  /class
  version
$Revision: 829682 $
  /version

  description
ReplicationHandler provides replication of index and configuration
files from Master to Slaves
  /description
  stats

stat name=handlerStart 
  1270463612968
/stat

stat name=requests 
  0
/stat

stat name=errors 
  0
/stat

stat name=timeouts 
  0
/stat

stat name=totalTime 
  0
/stat

stat name=avgTimePerRequest 
  NaN
/stat

stat name=avgRequestsPerSecond 
  0.0
/stat

stat name=indexSize 
  19.29 KB
/stat

stat name=indexVersion 
  1266984293131
/stat

stat name=generation 
  3
/stat

stat name=indexPath 
  C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index
/stat

stat name=isMaster 
  true
/stat

stat name=isSlave 
  false
/stat

stat name=confFilesToReplicate 
  schema.xml,stopwords.txt,elevate.xml
/stat

stat name=replicateAfter 
  [commit, startup]
/stat

stat name=replicationEnabled 
  true
/stat

  /stats
/entry



this is where the problem lies : i need the size of the index im not finding
the API 
nor is the statistics printing out(sysout) the same.
how to i get the size of the index
-- 
View this message in context: 
http://n3.nabble.com/Obtaining-SOLR-index-size-on-disk-tp500095p697599.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: cheking the size of the index using solrj API's

2010-04-05 Thread Na_D


 hi,
 
   I am using the piece of code given below
 
  ReplicationHandler handler2 = new ReplicationHandler();
 System.out.println( handler2.getDescription());
 
 
 NamedList statistics = handler2.getStatistics();
 System.out.println(Statistics   + statistics);

The result that i am getting (ie the printed statment is :
Statistics  
{handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN}


But the Statistics consists of the other info too:

class
org.apache.solr.handler.ReplicationHandler
  /class
  version
$Revision: 829682 $
  /version

  description
ReplicationHandler provides replication of index and configuration
files from Master to Slaves
  /description
  stats

stat name=handlerStart 
  1270463612968
/stat

stat name=requests 
  0
/stat

stat name=errors 
  0
/stat

stat name=timeouts 
  0
/stat

stat name=totalTime 
  0
/stat

stat name=avgTimePerRequest 
  NaN
/stat

stat name=avgRequestsPerSecond 
  0.0
/stat

stat name=indexSize 
  19.29 KB
/stat

stat name=indexVersion 
  1266984293131
/stat

stat name=generation 
  3
/stat

stat name=indexPath 
  C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index
/stat

stat name=isMaster 
  true
/stat

stat name=isSlave 
  false
/stat

stat name=confFilesToReplicate 
  schema.xml,stopwords.txt,elevate.xml
/stat

stat name=replicateAfter 
  [commit, startup]
/stat

stat name=replicationEnabled 
  true
/stat

  /stats
/entry



this is where the problem lies : i need the size of the index im not finding
the API
nor is the statistics printing out(sysout) the same.
how to i get the size of the index 
-- 
View this message in context: 
http://n3.nabble.com/cheking-the-size-of-the-index-using-solrj-API-s-tp692686p697603.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: cheking the size of the index using solrj API's

2010-04-05 Thread Peter Sturge

If you're using ReplicitionHandler directly, you already have the xml from
which to extract the 'indexSize' attribute.
From a client, you can get the indexSize by issuing:
  http://hostname:8983/solr/core/replication?command=details
This will give you an xml response.
Use:
  http://hostname:8983/solr/core/replication?command=detailswt=json
to give you a json string that has 'indexSize' within it:

{responseHeader:{status:0,QTime:0},details:{indexSize:6.63
KB,indexPath:usr//bin/solr/core0/index,commits:[[indexVersion,1259974360056,generation,1572,filelist,[segments_17o]],[indexVersion,1259974360057,generation,1573,filelist,[segments_17p,_zv.fdx,_zv.fnm,_zv.fdt,_zv.nrm,_zv.tis,_zv.prx,_zv.tii,_zv.frq]]],isMaster:true,isSlave:false,indexVersion:1259974360057,generation:1573,backup:[startTime,Mon
Apr 05 14:28:46 BST
2010,fileCount,17,status,success,snapshotCompletedAt,Mon Apr
05 14:28:47 BST 2010]},WARNING:This response format is
experimental.  It is likely to change in the future.}

Either way, you'll need to have some sort of parsing logic or formatting to
get just the index size bit.

Re: Apache Lucene EuroCon Call For Participation: Prague, Czech Republic May 20 21, 2010

2010-04-05 Thread Grant Ingersoll

Just a reminder, just over one week left open on the CFP.  Some great talks 
entered already.  Keep it up!

On Mar 24, 2010, at 8:03 PM, Grant Ingersoll wrote:

 Apache Lucene EuroCon Call For Participation - Prague, Czech Republic May 20 
  21, 2010
  
 All submissions must be received by Tuesday, April 13, 2010, 12 Midnight 
 CET/6 PM US EDT
 
 The first European conference dedicated to Lucene and Solr is coming to 
 Prague from May 18-21, 2010. Apache Lucene EuroCon is running on on 
 not-for-profit basis, with net proceeds donated back to the Apache Software 
 Foundation. The conference is sponsored by Lucid Imagination with additional 
 support from community and other commercial co-sponsors.
 
 Key Dates:
 24 March 2010: Call For Participation Open
 13 April 2010: Call For Participation Closes
 16 April 2010: Speaker Acceptance/Rejection Notification
 18-19 May 2010: Lucene and Solr Pre-conference Training Sessions
 20-21 May 2010: Apache Lucene EuroCon
 
 This conference creates a new opportunity for the Apache Lucene/Solr 
 community and marketplace, providing  the chance to gather, learn and 
 collaborate on the latest in Apache Lucene and Solr search technologies and 
 what's happening in the community and ecosystem. There will be two days of 
 Lucene and Solr training offered May 18  19, and followed by two days packed 
 with leading edge Lucene and Solr Open Source Search content and talks by 
 search and open source thought leaders.
 
 We are soliciting 45-minute presentations for the conference, 20-21 May 2010 
 in Prague. The conference and all presentations will be in English.
 
 Topics of interest include: 
 - Lucene and Solr in the Enterprise (case studies, implementation, return on 
 investment, etc.)
 - “How We Did It”  Development Case Studies
 - Spatial/Geo search
 - Lucene and Solr in the Cloud
 - Scalability and Performance Tuning
 - Large Scale Search
 - Real Time Search
 - Data Integration/Data Management
 - Tika, Nutch and Mahout
 - Lucene Connectors Framework
 - Faceting and Categorization
 - Relevance in Practice
 - Lucene  Solr for Mobile Applications
 - Multi-language Support
 - Indexing and Analysis Techniques
 - Advanced Topics in Lucene  Solr Development
 
 All accepted speakers will qualify for discounted conference admission. 
 Financial assistance is available for speakers that qualify.
 
 To submit a 45-minute presentation proposal, please send an email to 
 c...@lucene-eurocon.org containing the following information in plain text:
 
 1. Your full name, title, and organization
 
 2. Contact information, including your address, email, phone number
 
 3. The name of your proposed session (keep your title simple and relevant to 
 the topic)
 
 4. A 75-200 word overview of your presentation (in English); in addition to 
 the topic, describe whether your presentation is intended as a tutorial, 
 description of an implementation, an theoretical/academic discussion, etc.
 
 5. A 100-200-word speaker bio that includes prior conference speaking or 
 related experience (in English)
 
 To be considered, proposals must be received by 12 Midnight CET Tuesday, 13 
 April 2010 (Tuesday 13 April 6 PM US Eastern time, 3 PM US Pacific Time).
 
 Please email any questions regarding the conference to 
 i...@lucene-eurocon.org. To be added to the conference mailing list, please 
 email sig...@lucene-eurocon.org. If your organization is interested in 
 sponsorship opportunities, email
 spon...@lucene-eurocon.org
 
 Key Dates
 
 24 March 2010: Call For Participation Open
 13 April 2010: Call For Participation Closes
 16 April 2010: Speaker Acceptance/Rejection Notification
 18-19 May 2010  Lucene and Solr Pre-conference Training Sessions
 20-21 May 2010: Apache Lucene EuroCon
 
 We look forward to seeing you in Prague!
 
 Grant Ingersoll
 Apache Lucene EuroCon Program Chair
 www.lucene-eurocon.org

Re: cheking the size of the index using solrj API's

2010-04-05 Thread Ryan McKinley

On Fri, Apr 2, 2010 at 7:07 AM, Na_D nabam...@zaloni.com wrote:

 hi,


 I need to monitor the index for the following information:

 1. Size of the index
 2 Last time the index was updated.


If by 'size o the index' you mean document count, then check the Luke
Request Handler
http://wiki.apache.org/solr/LukeRequestHandler

ryan

Re: add/update document as distinct operations? Is it possible?

2010-04-05 Thread Julian Davchev

Hi,
I got the picture now.
Not having distinct add/update actions force me to implement custom
queueing mechanism.
Thanks
Cheers.

Erick Erickson wrote:
 One of the most requested features in Lucene/SOLR is to be able
 to update only selected fields rather than the whole document. But
 that's not how it works at present. An update is really a delete and
 an add.

 So for your second message, you can't do a partial update, you must
 update the whole document.

 I'm a little confused by what you *want* in your first e-mail. But the
 current way SOLR works, if the SOLR server first received the delete
 then the update, the index would have the document in it. But the
 opposite order would delete the documen.

 But this really doesn't sound like a SOLR issue, since SOLR can't
 magically divine the desired outcome. Somewhere you have
 to coordinate the requests or your index will not be what you expect.
 That is, you have to define what rules index modifications follow and
 enforce them. Perhaps you can consider a queueing mechanism of
 some sort (that you'd have to implement yourself...)

 HTH
 Erick


 On Thu, Apr 1, 2010 at 1:03 AM, Julian Davchev j...@drun.net wrote:

   
 Hi
 I have distributed messaging solution where I need to distinct between
 adding a document and just
 trying to update it.

 Scenario:
 1. message sent for document to be updated
 2. meanwhile another message is sent for document to be deleted and is
 executed before 1
 As a result when 1 comes instead of ignoring the update as document is
 no more...it will add it again.

 From what I see in manual I cannot distinct those operations which
 would. Any pointers?

 Cheers

Re: add/update document as distinct operations? Is it possible?

2010-04-05 Thread Israel Ekpo

Chris,

I don't see anything in the headers suggesting that Julian's message was a
hijack of another thread

On Thu, Apr 1, 2010 at 2:17 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : Subject: add/update document as distinct operations? Is it possible?
 : References:
 :
 dc9f7963609bed43b1ab02f3ce52863103dc35f...@bene-exch-01.benetech.local
 : In-Reply-To:
 :
 dc9f7963609bed43b1ab02f3ce52863103dc35f...@bene-exch-01.benetech.local

 http://people.apache.org/~hossman/#threadhijackhttp://people.apache.org/%7Ehossman/#threadhijack
 Thread Hijacking on Mailing Lists

 When starting a new discussion on a mailing list, please do not reply to
 an existing message, instead start a fresh email.  Even if you change the
 subject line of your email, other mail headers still track which thread
 you replied to and your question is hidden in that thread and gets less
 attention.   It makes following discussions in the mailing list archives
 particularly difficult.
 See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




 -Hoss




-- 
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Related terms/combined terms

2010-04-05 Thread Ahmet Arslan


 Not sure of the exact vocabulary I am looking for so I'll
 try to explain
 myself.
 
 Given a search term is there anyway to return back a list
 of related/grouped
 keywords (based on the current state of the index) for that
 term. 
 
 For example say I have a sports catalog and I search for
 Callaway. Is
 there anything that could give me back
 
 Callaway Driver
 Callaway Golf Balls
 Callaway Hat
 Callaway Glove
 
 Since these words are always grouped to together/related.
 Note sure if
 something like this is even possible.

ShingleFilterFactory[1] plus TermsComponent[2] can give you grouped (phrases) 
keywords. You need to create an extra field (populate it via copyField) that 
constructs shingles (token n-grams). After that you can retrieve those trigram 
or bi-gram tokens starting with callaway. 
solr/terms?terms=trueterms.fl=yourNewFieldterms.prefix=Callaway


[1]http://wiki.apache.org/solr/TermsComponent

[2]http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory

Re: add/update document as distinct operations? Is it possible?

2010-04-05 Thread Erick Erickson

I still don't see what the difference is. If there was a distinct
add/update process, how would that absolve you from having
to implement your own queueing? To have predictable index
content, you still must order your operations.

Best
Erick

On Mon, Apr 5, 2010 at 12:45 PM, Julian Davchev j...@drun.net wrote:

 Hi,
 I got the picture now.
 Not having distinct add/update actions force me to implement custom
 queueing mechanism.
 Thanks
 Cheers.

 Erick Erickson wrote:
  One of the most requested features in Lucene/SOLR is to be able
  to update only selected fields rather than the whole document. But
  that's not how it works at present. An update is really a delete and
  an add.
 
  So for your second message, you can't do a partial update, you must
  update the whole document.
 
  I'm a little confused by what you *want* in your first e-mail. But the
  current way SOLR works, if the SOLR server first received the delete
  then the update, the index would have the document in it. But the
  opposite order would delete the documen.
 
  But this really doesn't sound like a SOLR issue, since SOLR can't
  magically divine the desired outcome. Somewhere you have
  to coordinate the requests or your index will not be what you expect.
  That is, you have to define what rules index modifications follow and
  enforce them. Perhaps you can consider a queueing mechanism of
  some sort (that you'd have to implement yourself...)
 
  HTH
  Erick
 
 
  On Thu, Apr 1, 2010 at 1:03 AM, Julian Davchev j...@drun.net wrote:
 
 
  Hi
  I have distributed messaging solution where I need to distinct between
  adding a document and just
  trying to update it.
 
  Scenario:
  1. message sent for document to be updated
  2. meanwhile another message is sent for document to be deleted and is
  executed before 1
  As a result when 1 comes instead of ignoring the update as document is
  no more...it will add it again.
 
  From what I see in manual I cannot distinct those operations which
  would. Any pointers?
 
  Cheers

Re: Minimum Should Match the other way round

2010-04-05 Thread Grant Ingersoll


On Apr 3, 2010, at 10:18 AM, MitchK wrote:

 
 Hello,
 
 I want to tinkle a little bit with Solr, so I need a little feedback:
 Is it possible to define a Minimum Should Match for the document itself?
 
 I mean, it is possible to say, that a query this is my query should only
 match a document, if the document matches 3 of the four queried terms.
 
 However, I am searching for a solution that does something like: this is my
 query and the document has to consist of this query plus maximal - for
 example - two another terms?
 
 Example:
 Query: this is my query
 Doc1: this is my favorite query
 Doc2: I am searching for a lot of stuff, so this is my query
 Doc2: I'd like to say: this is my query
 
 Saying that maximal two another terms should occur in the document, Solr
 should response only doc1.
 If this is not possible out-of-the-box, I think one has to work with
 TermVectors, am I right?

Not quite following.  It sounds like you are saying you want to favor docs that 
are shorter, while still maximizing the number of terms that match, right?

You might look at the Similarity class and the SimilarityFactory as well in the 
Solr/Lucene code.

 
 I think it's possible to do so outside of Lucene/Solr by aking the response
 of the TermVectorsComponent and filtering the result-list. But I'd like to
 integrate this into Lucene/Solr itself.
 Any ideas which components I have to customize? 
 
 At the moment I am speculating that I have to customize the class which is
 collecting the result, before it is passing it to the ResponseWriter. 
 
 Kind regards
 - Mitch
 -- 
 View this message in context: 
 http://n3.nabble.com/Minimum-Should-Match-the-other-way-round-tp694867p694867.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does Lucidimagination search uses Multi facet query filter or uses session?

2010-04-05 Thread Grant Ingersoll

We are using multiselect facets like what you have below (although I haven't 
verified your syntax).  So no, we are not using sessions.

See http://www.lucidimagination.com/search/?q=multiselect+faceting#/s:email for 
help.

-Grant
http://www.lucidimagination.com

On Apr 1, 2010, at 12:35 PM, bbarani wrote:

 
 Hi,
 
 I am trying to create a search functionality same as that of
 Lucidimagination search.
 
 As of now I have formed the Facet query as below
 
 http://localhost:8080/solr/db/select?q=*:*fq={!tag=3DotHierarchyFacet}3DotHierarchyFacet:ABCfacet=onfacet.field={!ex=3DotHierarchyFacet}3DotHierarchyFacetfacet.field=ApplicationStatusFacetfacet.mincount=1
 
 Since I am having multiple facets I have planned to form the query based on
 the user selection. Something like below...if the user selects (multiple
 facets) application status as 'P' I would form the query as below
 
 http://localhost:8080/solr/db/select?q=*:*fq={!tag=3DotHierarchyFacet}3DotHierarchyFacet:NTSfq={!tag=ApplicationStatusFacet}ApplicationStatusFacet:Pfacet=onfacet.field={!ex=3DotHierarchyFacet}3DotHierarchyFacetfacet.field={!ex=ApplicationStatusFacet}facet.mincount=1
 
 Can someone let me know I am forming the correct query to perform
 multiselect facets? I just want to know if I am doing anything wrong in the
 query..
 
 We are also trying to achieve this using sessions but if we are able to
 solve this by query I would prefer using query than using session
 variables..
 
 Thanks,
 Barani
 -- 
 View this message in context: 
 http://n3.nabble.com/Does-Lucidimagination-search-uses-Multi-facet-query-filter-or-uses-session-tp691167p691167.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: feature request for ivalid data formats

: 
: I don't know whether this is the good place to ask it, or there is a special
: tool for issue
: requests.

We use Jira for bug reports and feature reuqests, but it's always a good 
idea to start with a solr-user email before filing a new bug/request to 
help discuss the behavior you are seeing.

: 2010.03.23. 13:27:23 org.apache.solr.common.SolrException log
: SEVERE: java.lang.NumberFormatException: For input string: 1595-1600
:at
: java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
:at java.lang.Integer.parseInt(Integer.java:456)
: 
: It would be great help in some cases, if I could know which field contained
: this data in wrong format.

you are 100% correct ... can you let us know what the rest of hte stack 
trace is (beyond that last line you posted) so we can figure out exactly 
where the bug is?

: SimplePostTool: FATAL: Solr returned an error: For_input_string_15951600
: __javalangNumberFormatException_For_input_string_15951600
: ___at_javalangNumberFormatExceptionforInputStringNumberFormat
: 
: (I added some line breaks for the shake of readability.)
: 
: Could not be returned a string with the same format as in Solr log?

Solr relies on the servlet container to format the error and return it to 
the user, with Jetty, the error does actually come back in human readable 
form as part of the response body -- what the SimplePostToll is printing 
out there is actually the one line HTTP response message which jetty (in 
it's infinite wisdom) set's using the entire response with the whitespace 
and newlinees escaped.

If you us something like curl -D - to hit a Solr URL, you'll see what i 
mean about the response message vs the response body, and if you use a 
differnet servlet container (like tomcat) you'll see wha i mean baout the 
servlet container having control over what the error messages look like.


-Hoss

Re: dismax multi search?


: I want to be able to direct some search terms to specific fields
: 
: I want to do something like this
: 
: keyword1 should search against book titles / authors
: 
: keyword2 should search against book contents / book info / user reviews

your question is a little vague ... will keyword1 and keyword2 be distinct 
params (ie: will the user tell you when certain words should be queried 
against titles/authors and when other keywords sould be queried against 
content/info/reviews) ... or are you going to have big ass giant workd 
lists, and anytime you see a word from one of those lists, you query a 
specific field for that word?

assuming you mean the first (and not hte second) situation, you can use 
nested query parsers with param substitutio to get some interesting 
results...

http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/
http://n3.nabble.com/How-to-compose-a-query-from-multiple-HTTP-URL-parameters-td519441.html#a679489



-Hoss

including external files in config by corename

2010-04-05 Thread Shawn Heisey

Is it possible to access the core name in a config file (such as 
solrconfig.xml) so I can include core-specific configlets into a common 
config file?  I would like to pull in different configurations for 
things like shards and replication, but have all the cores otherwise use 
an identical config file.


Also, I have been looking for the syntax to include a snippet and 
haven't turned anything up yet.


Thanks,
Shawn

Re: Related terms/combined terms

2010-04-05 Thread Blargy


Thanks for the response Mitch. 

I'm not too sure how well this will work for my needs but Ill certainly play
around with it. I think something more along the lines of Ahmet's solution
is what I was looking for. 
-- 
View this message in context: 
http://n3.nabble.com/Related-terms-combined-terms-tp694083p698327.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: no of cfs files are more that the mergeFactor


This sounds completley normal form what i remembe about mergeFactor.

Segmenets are merged by level meaning that with a mergeFactor of 5, once 
5 level 1 segments are formed they are merged into a single level 2 
segment.  then 5 more level 1 segments are allowed to form before the 
next merge (resulting in 2 legel 2 sements).  Once you have 5 level 2 
sements, then they are all merged into a single level 3 segment, etc...

: I had my mergeFactor as 5 , 
: but when i load a data with some 1,00,000 i got some 12 .cfs files in my
: data/index folder .
: 
: How come this is possible .
: in what context we can have more no of .cfs files 


-Hoss

Re: no of cfs files are more that the mergeFactor

I'm guessing the user is expecting there to be one cfs file for the 
index, and does not understand that its actually per segment.


On 04/05/2010 01:59 PM, Chris Hostetter wrote:

This sounds completley normal form what i remembe about mergeFactor.

Segmenets are merged by level meaning that with a mergeFactor of 5, once
5 level 1 segments are formed they are merged into a single level 2
segment.  then 5 more level 1 segments are allowed to form before the
next merge (resulting in 2 legel 2 sements).  Once you have 5 level 2
sements, then they are all merged into a single level 3 segment, etc...

: I had my mergeFactor as 5 ,
: but when i load a data with some 1,00,000 i got some 12 .cfs files in my
: data/index folder .
:
: How come this is possible .
: in what context we can have more no of .cfs files


-Hoss

   



--
- Mark

http://www.lucidimagination.com

Re: Getting solr response in HTML format : HTMLResponseWriter


: so I have tried to attach the xslt steelsheet to the response of SOLR with
: passing this 2 variables wt=xslttr=example.xsl
: 
: while example.xsl is an included steelsheet to SOLR , but the response in
: HTML was'nt very perfect .

can you elaborate on what you mean by wasn't very perfect ? ... what was 
wrong with it? ... was there an actaul bug, or were you just not happy 
with how it looked?  did you try modifying the exampl.xsl?  (it's intended 
purely as an example ... it's not ment to work for everyone as is)

: So i have readen on the net that we can write an extension to the
: QueryResponseWriter class like XMLResponseWriter (default)
: and i m trying to build that .
...
: I m proceeding like XMLREsponseWriter to create HTMLResponseWriter and i

I would strongly suggest that instead of doing this, you take a look at 
the velocity response writer (in contrib) or tweet teh XSL some more ... 
writing a custom HTMLResponseWriter isn't neraly as flexible as either of 
those other two options -- particularly because the ResponseWriter API 
requires you to deal with the Response objects in the order they are added 
by the RequestHandler -- which isn't neccessarily the same order you want 
to deal with them in an HTML response.  (this isn't typically a problem 
for most ResponseWriters because htey aren't typically intended to be read 
by humans)

: org.apache.solr.common.SolrException: Error loading class
: 'org.apache.solr.request.HTMLResponseWriter'

1) if you are writing a custom ResponseWriter, you should be using your 
own java package name, not org.apache.solr.request

: Caused by: java.lang.ClassNotFoundException:
: org.apache.solr.request.HTMLResponseWriter

2) it can't find your class.  did you compile it?  did you put it i na 
jar? where did you put the jar?  what does your solr install look like? 
... the details are the key to understanding why it can't find your class.



-Hoss

Re: exceptionhandling error-reporting?


: This client uses a simple user-agent that requires JSON-syntax while parsing 
: searchresults from solr, but when solr drops an exception, tomcat returns an 
: error-500 page to the client and it crashes. 

define crashes ? ... presumabl you are tlaking about the client crashing 
because it can't parse theerro response, correct? ... the best suggestion 
given the current state of Solr is to make hte client smart enough to not 
attempt parsing of hte response unless the response code is 200.

: I was wondering if theres already a way to prepare exceptions as 
error-reports 
: and integrate them into the search-result as a hint to the user? If it would 
: be just another element of the whole response-format, it would be possibly 
: compatible with any client out there. 

It's one of the oldest out standing improvements in the Solr issue 
tracker, but it hasn't gotten much love over the years...

https://issues.apache.org/jira/browse/SOLR-141

One possible workarround if you are comfortable with Java andif you are 
willing to always get the erros in a single response format (ie: JSON)...
 
you can customize the solr.war to specify an error jsp that your serlvet 
container will use to format all error responses.  you can make that JSP 
extract the error message from the Exception and output it in JSON format.



-Hoss

Re: Is this a bug of the RessourceLoader?


: Some applications (such as Windows Notepad), insert a UTF-8 Byte Order Mark
: (BOM) as the first character of the file. So, perhaps the first word in your
: stopwords list contains a UTF-8 BOM and thats why you are seeing this
: behavior.

Robert: BOMs are one of those things that strike me as being abhorent and 
inheriently evil because they seem to cause nothing but problems -- but in 
truth i understand very little baout them and have no idea if/when they 
actually add value.

If text files that start with a BOM aren't properly being dealt with by 
Solr right now, should we consider that a bug?  Is there something we 
can/should be doing in SolrResourceLoader to make Solr handle this 
situation better?


-Hoss

Re: selecting documents older than 4 hours