date:20091007

On Tue, Oct 6, 2009 at 1:09 PM, Rakhi Khatwani rkhatw...@gmail.com wrote:

 Hi,
i wanted to query solr and send the output  some reporting tool. has
 anyone done something like that? moreover, which reporting filter is good??
 ny suggesstions?


Can you be more specific on what you want to achieve? What kind of reports
are you looking for?

-- 
Regards,
Shalin Shekhar Mangar.

Re: solr optimize - no space left on device

Not sure but a quick search turned up:
http://www.walkernews.net/2007/07/13/df-and-du-command-show-different-used-disk-space/

Using upto 2x the index size can happen. Also check if there is a
snapshooter script running through cron which is making hard links to files
while a merge is in progress.

Do let us know if you make any progress. This is interesting.

On Tue, Oct 6, 2009 at 5:28 PM, Phillip Farber pfar...@umich.edu wrote:

 I am attempting to optimize a large shard on solr 1.4 and repeatedly get
 java.io.IOException: No space left on device. The shard, after a final
 commit before optimize, shows a size of about 192GB on a 400GB volume.  I
 have successfully optimized 2 other shards that were similarly large without
 this problem on identical hardware boxes.

 Before the optimize I see:

 % df -B1 .
 Filesystem 1B-blocks Used Available Use% Mounted on
 /dev/mapper/internal-solr--build--2
 435440427008 205681356800 225335255040 48%
 /l/solrs/build-2

 slurm-4:/l/solrs/build-2/data/index % du -B1
 205441486848 .

 There's a slight discrepancy between the du and df which appears to be
 orphaned inodes. But the du says there should be enough space to handle the
 doubling in size during optimization. However, for the second time we run
 out of space and the du and df are wildly different at that point and the
 volume is at 100%


 % df -B1 .

 Filesystem   1B-blocks  Used Available Use% Mounted on
 /dev/mapper/internal-solr--build--2
435440427008 430985760768  30851072 100%
 /l/solrs/build-2

 slurm-4:/l/solrs/build-2/data/index % du -B1
 252552298496.

 At this point it appears orphaned inodes are consuming space and not being
 freed-up. Any clue as to whether this is a lucene bug a solr bug or  some
 other problem.  Error traces follow.

 Thanks!

 Phil

 ---

 Oct 6, 2009 2:12:37 AM org.apache.solr.update.processor.LogUpdateProcessor
 finish
 INFO: {} 0 9110523
 Oct 6, 2009 2:12:37 AM org.apache.solr.common.SolrException log
 SEVERE: java.io.IOException: background merge hit exception: _ojl:C151080
 _169w:C141302 _1j36:C80405 _1j35:C2043 _1j34:C192 into _1j37 [optimize]
   at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2737)
   at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2658)
   at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:401)
   at
 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
   at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:168)
   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
   at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
   at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
   at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
   at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
   at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
   at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
   at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
   at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
   at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
   at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
   at
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
   at
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
   at
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
   at
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
   at java.lang.Thread.run(Thread.java:619)
 Caused by: java.io.IOException: No space left on device
   at java.io.RandomAccessFile.writeBytes(Native Method)
   at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
   at
 org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:719)
   at
 org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96)
   at
 org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85)
   at
 org.apache.lucene.store.BufferedIndexOutput.seek(BufferedIndexOutput.java:124)
   at

datadir configuration


hello
As I try to deploy my app on a tomcat server, I'd like to custome datadir
variable outside the solrconfig.xml file.

Is there a way to custom it in a context file?

Thanks
-- 
View this message in context: 
http://www.nabble.com/datadir-configuration-tp25782469p25782469.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: datadir configuration

2009-10-07 Thread Gasol Wu

Hi,

add JAVA_OPTS variable in TOMCAT_HOME/bin/catalina.sh like below,
JAVA_OPTS=$JAVA_OPTS -Dsolr.home=/opt/solr
-Dsolr.foo.data.dir=/opt/solr/data

solr.data.dir must mapping to dataDir in solrconfig.xml

here is example (solrconfig.xml):
dataDir${solr.foo.data.dir:/default/path/to/datadir}/dataDir

On Wed, Oct 7, 2009 at 4:27 PM, clico cl...@mairie-marseille.fr wrote:


 hello
 As I try to deploy my app on a tomcat server, I'd like to custome datadir
 variable outside the solrconfig.xml file.

 Is there a way to custom it in a context file?

 Thanks
 --
 View this message in context:
 http://www.nabble.com/datadir-configuration-tp25782469p25782469.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Doing SpellCheck in distributed search

2009-10-07 Thread balaji.a


Hi All,
   I am trying to get spell check suggestions in my distributed search query
using shards. I have 2 cores configured core0 and core1 both having spell
check component configured. On requesting search result using the following
query I don't get the spelling suggestions. 

http://localhost:8080/solr/core0/select?spellcheck=trueq=BrekFastshards=localhost:8080/solr/core0,localhost:8080/solr/core1

But I could able to get suggestions when i query single core using the url
given below

http://localhost:8080/solr/core0/select?spellcheck=trueq=BrekFast

On debugging the code (Solr 1.3) I can see suggestions coming from core0,
but while merging the result the suggestion value is getting lost. I am not
sure is it a bug in the code or its an enhancement to future release. Could
anyone guide me on how to acheive spellcheck over multiple cores? 

Thanks!
-- 
View this message in context: 
http://www.nabble.com/Doing-SpellCheck-in-distributed-search-tp25782755p25782755.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ISOLatin1AccentFilter before or after Snowball?

On Tue, Oct 6, 2009 at 4:33 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Hi all,

 from reading through previous posts on that subject, it seems like the
 accent filter has to come before the snowball filter.

 I'd just like to make sure this is so. If it is the case, I'm wondering
 whether snowball filters for i.e. French process accented language
 correctly, at all, or whether they remove accents anyway... Or whether
 accents should be removed whenever making use of snowball filters.


I'd think so but I'm not sure. Perhaps someone else can weigh in.



 And also: it really is meant to take UTF-8 as input, even though it is
 named ISOLatin1AccentFilter, isn't it?


See http://markmail.org/message/hi25u5iqusfu542b

-- 
Regards,
Shalin Shekhar Mangar.

Re: Questions about synonyms and highlighting

I'm not an expert on hit highlighting but please find some answers inline:

On Wed, Sep 30, 2009 at 9:03 PM, Nourredine K. nourredin...@yahoo.comwrote:

 Hi,

 Can you please give me some answers for those questions :

 1 - How can I get synonyms found for  a keyword ?

 I mean i search foo and i have in my synonyms.txt file the following
 tokens : foo, foobar, fee (with expand = true)
 My index contains foo and foobar. I want to display a message in a
 result page, on the header for example, only the 2 matched tokens and not
 fee  like Results found for foo and foobar


Whatever token is available in the index, will be matched but I don't think
it is possible to show only those synonyms which matched some documents.
Adding debugQuery=on can give you some more information like how the score
for a particular document was calculated for the given query.


 2 - Can solR make analysis on an index to extract associations between
 tokens ?

 for example , if foo often appears with fee in a field, it will
 associate the 2 tokens.


Solr won't compute associations but there are ways of achieving something
similar. For example, the MoreLikeThis functionality clusters related
documents through co-occurrence of terms in a given field. Also, the
TermVectorComponent can give you position information for terms in a
document. You can use that to build your own co-occurrence associations.

If you just want to query for two words within a fixed position difference,
you can do proximity matches.

http://lucene.apache.org/java/2_9_0/queryparsersyntax.html#Proximity%20Searches

Perhaps somebody else can weigh on your question #3 and #4.

-- 
Regards,
Shalin Shekhar Mangar.

Re: solr reporting tool adapter

2009-10-07 Thread Rakhi Khatwani

we basically wanna generate PDF reports which contain, tag clouds, bar
charts, pie charts etc.
Regards,
Raakhi

On Wed, Oct 7, 2009 at 1:28 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Tue, Oct 6, 2009 at 1:09 PM, Rakhi Khatwani rkhatw...@gmail.com
 wrote:

  Hi,
 i wanted to query solr and send the output  some reporting tool.
 has
  anyone done something like that? moreover, which reporting filter is
 good??
  ny suggesstions?
 
 
 Can you be more specific on what you want to achieve? What kind of reports
 are you looking for?

 --
 Regards,
 Shalin Shekhar Mangar.

Re: datadir configuration


What do I put in
dataDir${solr.foo.data.dir:/default/path/to/datadir}/dataDir
?

What is /default/path/to/datadir?



Gasol Wu wrote:
 
 Hi,
 
 add JAVA_OPTS variable in TOMCAT_HOME/bin/catalina.sh like below,
 JAVA_OPTS=$JAVA_OPTS -Dsolr.home=/opt/solr
 -Dsolr.foo.data.dir=/opt/solr/data
 
 solr.data.dir must mapping to dataDir in solrconfig.xml
 
 here is example (solrconfig.xml):
 dataDir${solr.foo.data.dir:/default/path/to/datadir}/dataDir
 
 On Wed, Oct 7, 2009 at 4:27 PM, clico cl...@mairie-marseille.fr wrote:
 

 hello
 As I try to deploy my app on a tomcat server, I'd like to custome datadir
 variable outside the solrconfig.xml file.

 Is there a way to custom it in a context file?

 Thanks
 --
 View this message in context:
 http://www.nabble.com/datadir-configuration-tp25782469p25782469.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/datadir-configuration-tp25782469p25783320.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Timeouts

On Wed, Oct 7, 2009 at 2:19 AM, Giovanni Fernandez-Kincade 
gfernandez-kinc...@capitaliq.com wrote:


 What does the maxCommitsToKeep(from SolrDeletionPolicy in SolrConfig.xml)
 parameter actually do? Increasing this value seems to have helped a little,
 but I'm wary of cranking it without having a better understanding of what it
 does.


maxCommitsToKeep is the number of commit points (a point-in-time snapshot of
the index) to keep from getting deleted. But deletion of commit points only
happens on startup or when someone calls commit/optimize.

-- 
Regards,
Shalin Shekhar Mangar.

Re : Questions about synonyms and highlighting

2009-10-07 Thread Nourredine K.

 I'm not an expert on hit highlighting but please find some answers inline:

Thanks Shalin for your answers. It helps a lot.

I post again questions #3 and #4 for the others :)


3 - Is it possible and if so How can I configure solR to set or not highlighting
for tokens with diacritics ? 


Settings for vélo (all highlighted) == the two words emvélo/em and
emvelo/em are highlighted
Settings for vélo == the first word emvélo/em is highlighted but not
the second  : velo 


4 - the same question for highlighting with lemmatisation? 


Settings for manage (all highlighted) == the two wordsemmanage/em and
emmanagement/em are highlighted
Settings for manage == the first word emmanage/em is highlighted but
not the second  : management 
Regard,

Nourredine.

Re: Indexing and searching of sharded/ partitioned databases and tables

Comments inline:

On Wed, Oct 7, 2009 at 2:01 PM, Jayant Kumar Gandhi jaya...@gmail.comwrote:


 Lets say I have 3 mysql databases each with 3 tables.

 Db1 : Tbl1, Tbl2, Tbl3
 Db2 : Tbl1, Tbl2, Tbl3
 Db3 : Tbl1, Tbl2, Tbl3

 All databases have the same number of tables with same table names as
 shown above. All tables have exactly the same structure as well. Each
 table has three fields:
 id, name, category

 Since the data is distributed this way, I don't have a way to search
 for a particular record using 'name'. I must look for it in all the 9
 tables. This is not scalable when lets say I have 20 databases each
 with 20 tables, meaning 400 queries needed to find a single record.

 Solr seemed like the solution to help.

 I followed the wiki tutorials:
 http://wiki.apache.org/solr/DataImportHandler
 http://wiki.apache.org/solr/DIHQuickStart
 http://wiki.apache.org/solr/DataImportHandlerFaq

 The following are my config files so far:
 
 solrconfig.xml
 
 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
  str name=configdata-config.xml/str
 /lst
 /requestHandler

 
 dataconfig.xml (so far)
 
 dataConfig
  dataSource type=JdbcDataSource name=ds1
 driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/Db1
 user=user-name password=password /
  dataSource type=JdbcDataSource name=ds2
 driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/Db2
 user=user-name password=password /
  dataSource type=JdbcDataSource name=ds3
 driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/Db3
 user=user-name password=password /
  document
entity name=record11 dataSource=ds1 query=select
 id,name,category from Tbl1/entity
entity name=record12 dataSource=ds1 query=select
 id,name,category from Tbl2/entity
entity name=record13 dataSource=ds1 query=select
 id,name,category from Tbl3/entity
entity name=record21 dataSource=ds2 query=select
 id,name,category from Tbl1/entity
entity name=record22 dataSource=ds2 query=select
 id,name,category from Tbl2/entity
entity name=record23 dataSource=ds2 query=select
 id,name,category from Tbl3/entity
entity name=record31 dataSource=ds3 query=select
 id,name,category from Tbl1/entity
entity name=record32 dataSource=ds3 query=select
 id,name,category from Tbl2/entity
entity name=record33 dataSource=ds3 query=select
 id,name,category from Tbl3/entity
  /document
 /dataConfig

 
 Doubts/ Questions:
 

 - Is this the right away to achieve indexing this data?
 - Is there a better way to achieve this? Imagine 20 databases with 20
 tables each translates to 400 lines in the XML. This doesn't scale for
 something like 200 databases and 200 tables each. Will solr continue
 to work/ index properly if I had 4 entity rows without going out
 of memory?


Seems OK. Your original database is sharded so I'm guessing the amount of
data is quite large. The number of root entities does not matter. What
matters is the total number of documents. As you go from indexing 20
database shards to 200 shards, you will likely cross a point where indexing
all of them on a single Solr box is either impossible (due to the large
number of documents) or very slow. Similarly, response times may also
suffer.

Solr supports distributed search wherein you can shard your Solr index each
having a disjoint set of documents. You can continue to query Solr normally
(except for providing an additional shards request parameter) and Solr will
make sure it gets results from all shards, merges and returns them as if you
were querying a single Solr instance.

See http://wiki.apache.org/solr/DistributedSearch for more details.


 - I will really want that I can search thru the complete database for
 a 'name' and do things like 'category' filtering etc easily
 independent of the entity name/ datasource. For me they are all
 records of the same type.


That is very much possible out of the box.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Doing SpellCheck in distributed search

On Wed, Oct 7, 2009 at 2:14 PM, balaji.a reachbalaj...@gmail.com wrote:


 Hi All,
   I am trying to get spell check suggestions in my distributed search query
 using shards.


SpellCheckComponent does not support distributed search yet. There is an
issue open with a patch. If you decide to use, do let us know your feedback:

https://issues.apache.org/jira/browse/SOLR-785

-- 
Regards,
Shalin Shekhar Mangar.

Re: solr reporting tool adapter

On Wed, Oct 7, 2009 at 2:51 PM, Rakhi Khatwani rkhatw...@gmail.com wrote:

 we basically wanna generate PDF reports which contain, tag clouds, bar
 charts, pie charts etc.


Faceting on a field will give you top terms and frequency information which
can be used to create tag clouds. What do you want to plot on a bar chart?

I don't know of a reporting tool which can hook into Solr for creating such
things.

-- 
Regards,
Shalin Shekhar Mangar.

Re: datadir configuration

On Wed, Oct 7, 2009 at 2:56 PM, clico cl...@mairie-marseille.fr wrote:


 What do I put in
 dataDir${solr.foo.data.dir:/default/path/to/datadir}/dataDir
 ?

 What is /default/path/to/datadir?


Solr variables are written like:

${variable_name:default_value}

If you are configuring the dataDir as an environment variable, you can
remove the default value.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr Quries

First, please do not cross-post messages to both solr-dev and solr-user.
Solr-dev is only for development related discussions.

Comments inline:

On Wed, Oct 7, 2009 at 9:59 AM, Pravin Karne
pravin_ka...@persistent.co.inwrote:

 Hi,
 I am new to solr. I have following queries :


 1.   Is solr work in distributed environment ? if yes, how to configure
 it?


Yes, Solr works in distributed environment. See
http://wiki.apache.org/solr/DistributedSearch





 2.   Is solr have Hadoop support? if yes, how to setup it with
 Hadoop/HDFS? (Note: I am familiar with Hadoop)


Not currently. There is some work going on at
https://issues.apache.org/jira/browse/SOLR-1457




 3.   I have employee information(id, name ,address, cell no, personal
 info) of 1 TB ,To post(index)this data on solr server, shall I have to
 create xml file with this data and then post it to solr server? Or is there
 any other optimal way?  In future my data will grow upto 10 TB , then how
 can I index this data ?(because creating xml is more headache )


XML is just one way. You could use also CSV. If you use, the Solrj java
client with Solr 1.4 (soon to be released), it uses an efficient binary
format for posting data to Solr.

-- 
Regards,
Shalin Shekhar Mangar.

SpellCheck with filter/conditions

2009-10-07 Thread R. Tan

Sorry, newbie here, figured it out.

How do you get spelling suggestions on a specific resultset, filtered by a
certain facet for example?


On Wed, Oct 7, 2009 at 8:43 AM, R. Tan tanrihae...@gmail.com wrote:

 Nice. In comparison, how do you do it with faceting?

 Two other approaches are to use either the TermsComponent (new in Solr
 1.4) or faceting.



 On Wed, Oct 7, 2009 at 1:51 AM, Jay Hill jayallenh...@gmail.com wrote:

 Have a look at a blog I posted on how to use EdgeNGrams to build an
 auto-suggest tool:

 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

 You could easily add filter queries to this approach. Ffor example, the
 query used in the blog could add filter queries like this:

 http://localhost:8983/solr/select/?q=user_query:
 ”i”wt=jsonfl=user_queryindent=onechoParams=nonerows=10sort=count
 descfq=yourField:yourQueryfq=anotherField:anotherQuery

 -Jay
 http://www.lucidimagination.com




 On Tue, Oct 6, 2009 at 4:40 AM, R. Tan tanrihae...@gmail.com wrote:

  Hello,
  What's the best way to get auto-suggested terms/keywords that is
 filtered
  by
  one or more fields? TermsComponent should have been the solution but
  filters
  are not supported.
 
  Thanks,
  Rihaed

Re: Re : Questions about synonyms and highlighting


 4 - the same question for highlighting with lemmatisation?
 Settings for manage (all highlighted) == the two wordsemmanage/em
 and
 emmanagement/em are highlighted
 Settings for manage == the first word emmanage/em is highlighted
 but
 not the second  : management


There is no Lemmatisation support in Solr as of now. The only support you
get is stemming.
Let me understand this correctly - you basically want the searches to happen
with stemmed base but want to selectively highlight the original and/or
stemmed words. Right? If yes, then AFAIK, this is not possible. Search
passes through your fields analyzers (tokenizers and filters). Highlighters,
typically, use the same set of analyzers and the behavior will be the same
as in search; this essentially means that the keywords manage, managing,
management and manager are REDUCED to manage for searchers and
highlighters.
If this can be done, then the only place to enable your feature could be
Lucene highlighter api's. Someone more knowledegable can tell you, if that
is possible.

I have no idea about your #3, though my idea of handling accentuation is to
apply a  ISOLatin1AccentFilterFactory and get rid of them altogether :)
I am curious to know the answer though.

Cheers
Avlesh

On Wed, Oct 7, 2009 at 3:17 PM, Nourredine K. nourredin...@yahoo.comwrote:

  I'm not an expert on hit highlighting but please find some answers
 inline:

 Thanks Shalin for your answers. It helps a lot.

 I post again questions #3 and #4 for the others :)


 3 - Is it possible and if so How can I configure solR to set or not
 highlighting
 for tokens with diacritics ?


 Settings for vélo (all highlighted) == the two words emvélo/em and
 emvelo/em are highlighted
 Settings for vélo == the first word emvélo/em is highlighted but
 not
 the second  : velo


 4 - the same question for highlighting with lemmatisation?


 Settings for manage (all highlighted) == the two wordsemmanage/em
 and
 emmanagement/em are highlighted
 Settings for manage == the first word emmanage/em is highlighted
 but
 not the second  : management
 Regard,

 Nourredine.

Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

2009-10-07 Thread Mint Ekalak


I run solr successfully until i updated recently and dead at this line where
ImportTime  '${dataimporter.last_index_time}' from data-import.xml

i got this error

org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
execute query: select * from newheader where ImportTime  'Wed Oct 07
20:17:05 EST 2009'  Processing Document # 1
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextModifiedRowKey(SqlEntityProcessor.java:81)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextModifiedRowKey(EntityProcessorWrapper.java:251)
at
org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:621)
at
org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:259)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:173)
at
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:352)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Conversion
failed when converting date and/or time from character string.
at
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:196)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1458)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:733)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:631)
at
com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4016)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1414)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:176)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:151)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.execute(SQLServerStatement.java:604)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:246)
... 11 more



Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
 
 really?
 I don't remember that being changed.
 
 what difference do u notice?
 
 On Wed, Oct 7, 2009 at 2:30 AM, michael8 mich...@saracatech.com wrote:

 Just looking for confirmation from others, but it appears that the
 formatting
 of last_index_time from dataimport.properties (using DataImportHandler)
 is
 different in 1.4 vs. that in 1.3.  I was troubleshooting why delta
 imports
 are no longer working for me after moving over to solr 1.4 (10/2 nighly)
 and
 noticed that format is different.

 Michael
 --
 View this message in context:
 http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25776496.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 
 -- 
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com
 
 

-- 
View this message in context: 
http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25783768.html
Sent from the Solr - User mailing list archive at Nabble.com.

ApacheCon US

2009-10-07 Thread Grant Ingersoll

Just a friendly reminder to all about Lucene ecosystem events at  
ApacheCon US this year.  We have two days of talks on pretty much  
every project under Lucene (see http://lucene.apache.org/#14+August+2009+-+Lucene+at+US+ApacheCon 
) plus a meetup and a two day training on Lucene and a 1 day training  
on Solr.  The Lucene training will cover Lucene 2.9 and I'm sure  
Erik's Solr one will cover Solr 1.4.  I also know there will be quite  
a few Lucene, et. al. committers at ApacheCon this year, so it should  
be a good year to interact and discuss your favorite projects.


ApacheCon US is in Oakland (near San Francisco) the week of November  
2nd.  The trainings are on the 2nd and 3rd, and the main conference  
starts on the 4th.


You can register at http://www.us.apachecon.com/c/acus2009/

Hope to see you there,
Grant

Re: ISOLatin1AccentFilter before or after Snowball?

2009-10-07 Thread Chantal Ackermann


 See http://markmail.org/message/hi25u5iqusfu542b

Thank you for the link, Shalin!
It could be worth copying that to the wiki?

Cheers!
Chantal



I'd just like to make sure this is so. If it is the case, I'm wondering
whether snowball filters for i.e. French process accented language
correctly, at all, or whether they remove accents anyway... Or whether
accents should be removed whenever making use of snowball filters.



I'd think so but I'm not sure. Perhaps someone else can weigh in.



And also: it really is meant to take UTF-8 as input, even though it is
named ISOLatin1AccentFilter, isn't it?



See http://markmail.org/message/hi25u5iqusfu542b

--
Regards,
Shalin Shekhar Mangar.

Re: Solr Quries


Hi Pravin,

1. Is solr work in distributed environment ? if yes, how to configure it?
Yep. You can achieve this with Sharding.
For example: Install and Configure Solr on two machines and declare any one
of those as master. Insert shard parameters while you index and search your
data.

2. Is solr have Hadoop support? if yes, how to setup it with Hadoop/HDFS?
(Note: I am familiar with Hadoop)
Sorry. No idea.

3. I have employee information(id, name ,address, cell no, personal info) of
1 TB ,To post(index)this data on solr server, shall I have to create xml
file with this data and then post it to solr server? Or is there any other
optimal way?  In future my data will grow upto 10 TB , then how can I index
this data ?(because creating xml is more headache )
I think, XML is not the best way. I don't suggest it. If you have that 1 TB
data in a database you can achieve this simply using full import command.
Configure your DB details in solr-config.xml and data-config.xml and add you
DB driver jar to solr lib directory. Now import the data in slices (say dept
wise, or in some category wise..). In future, you can import the data from a
DB or you can index the data directly using client-API with simple java
beans.

Hope this info helps you.

Regards,
Sandeep Tagore
-- 
View this message in context: 
http://www.nabble.com/Solr-Quries-tp25780371p25783891.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Doing SpellCheck in distributed search

2009-10-07 Thread balaji.a


Thanks Shalin! I applied your patch and deployed the war. While debugging the
overridden method SpellCheckComponent.finishStage is not getting invoked by
the SearchHandler. Instead its invoking the  SearchComponent.finishStage
method. Do I need to configure anything extra to make it work? My current
configuration is as follows:

  searchComponent name=spellcheck class=solr.SpellCheckComponent

str name=queryAnalyzerFieldTypetextSpell/str

lst name=spellchecker
  str name=namedefault/str
  str name=fieldspell/str
  str name=spellcheckIndexDir./spellchecker1/str

/lst
lst name=spellchecker
  str name=namejarowinkler/str
  str name=fieldspell/str
  !-- Use a different Distance Measure --
  str
name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str
  str name=spellcheckIndexDir./spellchecker2/str

/lst
  /searchComponent

 requestHandler name=standard class=solr.SearchHandler default=true
!-- default values for query parameters --
 lst name=defaults
   str name=echoParamsexplicit/str
   !--
   int name=rows10/int
   str name=fl*/str
   str name=version2.1/str
--
!-- omp = Only More Popular --
str name=spellcheck.onlyMorePopularfalse/str
!-- exr = Extended Results --
str name=spellcheck.extendedResultsfalse/str
!--  The number of suggestions to return --
str name=spellcheck.count1/str
  /lst
  arr name=last-components
strspellcheck/str
  /arr
  /requestHandler



Shalin Shekhar Mangar wrote:
 
 On Wed, Oct 7, 2009 at 2:14 PM, balaji.a reachbalaj...@gmail.com wrote:
 

 Hi All,
   I am trying to get spell check suggestions in my distributed search
 query
 using shards.
 
 
 SpellCheckComponent does not support distributed search yet. There is an
 issue open with a patch. If you decide to use, do let us know your
 feedback:
 
 https://issues.apache.org/jira/browse/SOLR-785
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/Doing-SpellCheck-in-distributed-search-tp25782755p25783896.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

On Wed, Oct 7, 2009 at 3:53 PM, Mint Ekalak mint@gmail.com wrote:


 I run solr successfully until i updated recently and dead at this line
 where
 ImportTime  '${dataimporter.last_index_time}' from data-import.xml

 i got this error

 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 execute query: select * from newheader where ImportTime  'Wed Oct 07


Thanks for reporting the error. This seems to be a bug. I've opened an
issue:

https://issues.apache.org/jira/browse/SOLR-1496

-- 
Regards,
Shalin Shekhar Mangar.

Re: Indexing and searching of sharded/ partitioned databases and tables


Hi Jayant,
You can use Solr to achieve your objective.
The data-config.xml which you posted is incomplete.

I would like to suggest you a way to index the full data.
Try to index a database at a time. Sample xml conf.

dataSource type=JdbcDataSource name=ds1 driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/Db1 user=user-name password=password /
 document name=Tbl1
   entity name=Tbl1 query=select id,name,category from Tbl1
field column=id name=id /
field column=name name=name /
field column=category name=category /
/entity/document
document name=Tbl2
   entity name=Tbl2 query=select id,name,category from Tbl2
field column=id name=id /
field column=name name=name /
field column=category name=category /
/entity/document
document name=Tbl3
   entity name=Tbl3 query=select id,name,category from Tbl3
field column=id name=id /
field column=name name=name /
field column=category name=category /
/entity/document

You can write an automated program which will change the DB conf details in
that xml and fire the full import command. You can use
http://localhost:8983/solr/dataimport url to check the status of the data
import.

But be careful while declaring the uniqueKey field. Make sure that you are
not overwriting the records.
And if you are working on large data sets, you can use Solr Sharding
concept.

Let us know if you have any issues.

Regards,
Sandeep Tagore
-- 
View this message in context: 
http://www.nabble.com/Indexing-and-searching-of-sharded--partitioned-databases-and-tables-tp25782544p25783916.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Doing SpellCheck in distributed search

2009-10-07 Thread balaji.a


Sorry! it was my mistake of not copying the war at correct location.


balaji.a wrote:
 
 Thanks Shalin! I applied your patch and deployed the war. While debugging
 the overridden method SpellCheckComponent.finishStage is not getting
 invoked by the SearchHandler. Instead its invoking the 
 SearchComponent.finishStage method. Do I need to configure anything extra
 to make it work? My current configuration is as follows:
 
   searchComponent name=spellcheck class=solr.SpellCheckComponent
 
 str name=queryAnalyzerFieldTypetextSpell/str
 
 lst name=spellchecker
   str name=namedefault/str
   str name=fieldspell/str
   str name=spellcheckIndexDir./spellchecker1/str
 
 /lst
 lst name=spellchecker
   str name=namejarowinkler/str
   str name=fieldspell/str
   !-- Use a different Distance Measure --
   str
 name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str
   str name=spellcheckIndexDir./spellchecker2/str
 
 /lst
   /searchComponent
 
  requestHandler name=standard class=solr.SearchHandler
 default=true
 !-- default values for query parameters --
  lst name=defaults
str name=echoParamsexplicit/str
!--
int name=rows10/int
str name=fl*/str
str name=version2.1/str
 --
 !-- omp = Only More Popular --
 str name=spellcheck.onlyMorePopularfalse/str
 !-- exr = Extended Results --
 str name=spellcheck.extendedResultsfalse/str
 !--  The number of suggestions to return --
 str name=spellcheck.count1/str
   /lst
   arr name=last-components
 strspellcheck/str
   /arr
   /requestHandler
 
 
 
 Shalin Shekhar Mangar wrote:
 
 On Wed, Oct 7, 2009 at 2:14 PM, balaji.a reachbalaj...@gmail.com wrote:
 

 Hi All,
   I am trying to get spell check suggestions in my distributed search
 query
 using shards.
 
 
 SpellCheckComponent does not support distributed search yet. There is an
 issue open with a patch. If you decide to use, do let us know your
 feedback:
 
 https://issues.apache.org/jira/browse/SOLR-785
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Doing-SpellCheck-in-distributed-search-tp25782755p25783922.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing and searching of sharded/ partitioned databases and tables

On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore sandeep.tag...@gmail.comwrote:


 Hi Jayant,
 You can use Solr to achieve your objective.
 The data-config.xml which you posted is incomplete.


Sandeep, the data-config that Jayant posted is not incomplete. The field
declaration is not necessary if the name of the column in the database and
the field name in schema.xml is the same.


 I would like to suggest you a way to index the full data.
 Try to index a database at a time. Sample xml conf.

 dataSource type=JdbcDataSource name=ds1 driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost/Db1 user=user-name password=password /
  document name=Tbl1
   entity name=Tbl1 query=select id,name,category from Tbl1
field column=id name=id /
field column=name name=name /
field column=category name=category /
 /entity/document
 document name=Tbl2
   entity name=Tbl2 query=select id,name,category from Tbl2
field column=id name=id /
field column=name name=name /
field column=category name=category /
 /entity/document
 document name=Tbl3
   entity name=Tbl3 query=select id,name,category from Tbl3
field column=id name=id /
field column=name name=name /
field column=category name=category /
 /entity/document

 You can write an automated program which will change the DB conf details in
 that xml and fire the full import command. You can use
 http://localhost:8983/solr/dataimport url to check the status of the data
 import.


You could do that but I don't think it is required. If you do want to do
this, it is possible to post the data-config.xml to /dataimport (this is how
the dataimport.jsp works)


 But be careful while declaring the uniqueKey field. Make sure that you
 are
 not overwriting the records.


Yes, good point. That is a typical problem with sharded databases with
auto-increment primary key. If you do not have unique keys, you can
concatenate the shard name with the value of the primary key.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Indexing and searching of sharded/ partitioned databases and tables

On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore sandeep.tag...@gmail.comwrote:


 You can write an automated program which will change the DB conf details in
 that xml and fire the full import command. You can use
 http://localhost:8983/solr/dataimport url to check the status of the data
 import.


Also note that full-import deletes all existing documents. So if you write
such a program which changes DB conf details, make sure you invoke the
import command (new in Solr 1.4) to avoid deleting the other documents.

-- 
Regards,
Shalin Shekhar Mangar.

Re : Re : Questions about synonyms and highlighting

2009-10-07 Thread Nourredine K.

Thanks Avlesh.

Now, I understand better how higtlighting works.

As you've said, since it is based on the analysers, higtlighting will handle 
things like search.

A precision about #3 and #4 examples , they are exclusives : I wanted to know 
how to do higtlighting with stemming OR without (not both in same time)

So I think you've answered to #3 too :) All depend on your analysers. And for 
my case, the ISOLatin1AccentFilterFactory could do the job.

Thanks again Shalin and Avlesh.

Regard,

Nourredine.


 There is no Lemmatisation support in Solr as of now. The only support you
 get is stemming.
 Let me understand this correctly - you basically want the searches to happen
 with stemmed base but want to selectively highlight the original and/or
 stemmed words. Right? If yes, then AFAIK, this is not possible. Search
 passes through your fields analyzers (tokenizers and filters). Highlighters,
 typically, use the same set of analyzers and the behavior will be the same
 as in search; this essentially means that the keywords manage, managing,
 management and manager are REDUCED to manage for searchers and
 highlighters.
 If this can be done, then the only place to enable your feature could be
 Lucene highlighter api's. Someone more knowledegable can tell you, if that
 is possible.

 I have no idea about your #3, though my idea of handling accentuation is to
 apply a  ISOLatin1AccentFilterFactory and get rid of them altogether :)
 I am curious to know the answer though.

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail

Re: datadir configuration


I tried this in my context.xml
It doesn't work
Environment 
name=solr/home 
type=java.lang.String 
value=D:\workspace\solr\home 
override=true /
Environment 
name=solr.data.dir 
type=java.lang.String 
value=D:workspace\solr\datas 
override=true /

-- 
View this message in context: 
http://www.nabble.com/datadir-configuration-tp25782469p25783937.html
Sent from the Solr - User mailing list archive at Nabble.com.

manage rights


Hi everybody
As I'm ready to deploy my solr server (after many tests and use cases)
I'd like ton configure my server in order that some request cannot be post

As an example :
My CMS or data app can use
- dataimport
- and other indexing  commands

My website can only perform a search on the server

could one explain me where this configuration has to be done?

Thanks
-- 
View this message in context: 
http://www.nabble.com/manage-rights-tp25784152p25784152.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr optimize - no space left on device

2009-10-07 Thread Phillip Farber


All,

We're puzzled why we're still unable to optimize a 192GB index on a LVM 
volume that has 406GB available. We are not using Solr distribution. 
There is no snapshooter in the picture. We run out of disk capacity with 
a df showing 100% but a du showing just  379GB of files.


Restarting tomcat causes space to be recovered and many segments to be 
deleted leaving just 3 from the original 33. Issuing another optimize at 
that point causes solr to run for a while and then show no further 
activity (cpu,memory consumption) in jconsole.  The 3 segments do not 
merge into one.


% df -h .
/dev/mapper/internal-solr--build--2
 size  usedavail
  406G  402G   30M 100% /l/solrs/build-2

Also suspicious is the 406G vs. 402G vs. 30M for size vs. used vs. avail

Yesterday, after our 2nd try, lsof showed several deleted files that 
were still open that apparently were consuming space almost 134GB.


jsvc   8381tomcat  377u  REG  253,6  13369098240 
   1982471 /l/solrs/build-2/data/index/_1j37.tis (deleted)
jsvc   8381tomcat  378u  REG  253,6184778752 
   1982472 /l/solrs/build-2/data/index/_1j37.tii (deleted)
jsvc   8381tomcat  379u  REG  253,6  34053685248 
   1982473 /l/solrs/build-2/data/index/_1j37.frq (deleted)
jsvc   8381tomcat  380u  REG  253,6 130411978752 
   1982474 /l/solrs/build-2/data/index/_1j37.prx (deleted)


That theory did not work because the error log showed that solr was 
trying to merge into the _1j37 segment files showing as deleted in the 
lsof above when it ran out of space so those are a symptom not a cause 
of the lost space:


SEVERE: java.io.IOException: background merge hit exception: 
_ojl:C151080 _169w:C141302 _1j36:C80405 _1j35:C2043 _1j34:C192 into 
_1j37 [optimizee]: java.io.IOException: background merge hit 
exception: _ojl:C151080 _169w:C141302 _1j36:C80405 _1j35:C2043 
_1j34:C192 into _1j37 [


We restored the pre-optimized index again, restarted tomcat and tried to 
optimize using SerialMergePolicy instead of the default 
ConcurrentMergePolicy under the theory that concurrent merges could 
somehow take more than 2X disk space.


The optimize failed again with out of space error.  This time there 
where no deleted files in the lsof output.


This is one shard out of 10.  A couple the shards were around 192GB and 
merged successfully.  Any suggestions on how to debug this would be 
greatly appreciated.


Thanks!

Phil
hathitrust.org
University of Michigan


Shalin Shekhar Mangar wrote:

Not sure but a quick search turned up:
http://www.walkernews.net/2007/07/13/df-and-du-command-show-different-used-disk-space/

Using upto 2x the index size can happen. Also check if there is a
snapshooter script running through cron which is making hard links to files
while a merge is in progress.

Do let us know if you make any progress. This is interesting.

On Tue, Oct 6, 2009 at 5:28 PM, Phillip Farber pfar...@umich.edu wrote:


I am attempting to optimize a large shard on solr 1.4 and repeatedly get
java.io.IOException: No space left on device. The shard, after a final
commit before optimize, shows a size of about 192GB on a 400GB volume.  I
have successfully optimized 2 other shards that were similarly large without
this problem on identical hardware boxes.

Before the optimize I see:

% df -B1 .
Filesystem 1B-blocks Used Available Use% Mounted on
/dev/mapper/internal-solr--build--2
435440427008 205681356800 225335255040 48%
/l/solrs/build-2

slurm-4:/l/solrs/build-2/data/index % du -B1
205441486848 .

There's a slight discrepancy between the du and df which appears to be
orphaned inodes. But the du says there should be enough space to handle the
doubling in size during optimization. However, for the second time we run
out of space and the du and df are wildly different at that point and the
volume is at 100%


% df -B1 .

Filesystem   1B-blocks  Used Available Use% Mounted on
/dev/mapper/internal-solr--build--2
   435440427008 430985760768  30851072 100%
/l/solrs/build-2

slurm-4:/l/solrs/build-2/data/index % du -B1
252552298496.

At this point it appears orphaned inodes are consuming space and not being
freed-up. Any clue as to whether this is a lucene bug a solr bug or  some
other problem.  Error traces follow.

Thanks!

Phil

---

Oct 6, 2009 2:12:37 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 9110523
Oct 6, 2009 2:12:37 AM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: background merge hit exception: _ojl:C151080
_169w:C141302 _1j36:C80405 _1j35:C2043 _1j34:C192 into _1j37 [optimize]
  at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2737)
  at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2658)
  at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:401)
  at

Re: Problems with DIH XPath flatten

2009-10-07 Thread Adam Foltzer

Here's a sample:

?xml version=1.0 encoding=ISO-8859-1?
!DOCTYPE document [
!ENTITY nbsp #160;
!ENTITY copy #169;
!ENTITY reg #174;
]
document
  kbml version=-//Indiana University//DTD KBML 0.9//EN
kbqIn Mac OS X, how do I enable or disable the firewall?/kbq
body
pkbh docid=aghe access=allowedMac OS
Xdomainall/domainvisibilityvisible/visibility/kbh includes
an easy-to-use kbh docid=aoru
access=allowedfirewalldomainall/domainvisibilityvisible/visibility/kbh
that
can prevent potentially harmful incoming connections from other
computers. To turn it on or off:/p


h3Mac OS X 10.6 (Snow Leopard)/h3

olliFrom the Apple menu, select miSystem Preferences...†/mi.
When the codeSystem Preferences/code window appears, from the
miView/mi menu, select miSecurity/mi.

br clear=none/br clear=none/
/liliClick the miFirewall/mi tab.

...

/li/ol
/body
xtra
  term weight=0macos/term
  term weight=0macintosh/term
  term weight=0apple/term
  term weight=0macosx/term

...

/xtra
  /kbml
  metadata
docidaozg/docid
owner firstname= lastname=Macintosh Supportscmac/owner

...

  /metadata
/document

The /document/kbml/kbq works fine, but as you can see, it has no
children. The actual content of the document is within the body
element, though, which requires some flattening.

Thanks for your time,
Adam

2009/10/6 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:
 send a small sample xml snippet you are trying to index and it may help

 On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer acfolt...@gmail.com wrote:
 Hi all,

 I'm trying to set up DataImportHandler to index some XML documents available
 over web services. The XML includes both content and metadata, so for the
 indexable content, I'm trying to just index everything under the content
 tag:

 entity dataSource=kbws name=kbxml pk=title
        url=resturl processor=XPathEntityProcessor
        forEach=/document transformer=HTMLStripTransformer
 flatten=true
 field column=content name=content xpath=/document/kbml/body
 flatten=true stripHTML=true /
 field column=title name=title xpath=/document/kbml/kbq /
 /entity

 The result of this is that the title field gets populated and indexed (there
 are no child nodes of /document/kbml/kbq), but content does not get indexed
 at all. Since /document/kbml/body has many children, I expected that
 flatten=true would store all of the body text in the field. Instead, it
 stores nothing at all. I've tried this with many combinations of
 transformers and flatten options, and the result is the same each time.

 Here are the relevant field declarations from the schema (the type=text is
 just the one from the example's schema.xml). I have tried combinations here
 as well of stored= and multiValued=, with the same result each time.

 field name=title type=text indexed=true stored=true
 multiValued=true /
 field name=content type=text indexed=true stored=true
 multiValued=true /

 If it would help troubleshooting, I could send along some sample XML. I
 don't want to spam the list with an attachment unless it's necessary, though
 :)

 Thanks in advance for your help,

 Adam Foltzer




 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Solr Trunk Heap Space Issues

2009-10-07 Thread Jeff Newburn

Here is what I discovered after dozens of reindexes.  We have a tool that is
pulling all of the documents' uniqueIds.  This tools is causing the cache to
fill up.  We turned it off and the system was able to reindex.

Here is what is still puzzling to me about this entire scenario.

When we had only 1 core active I was able to reindex the core even with the
tool filling up the document cache.  As soon as I added a second empty core
the OOM stuff started.  Could this be caused by the second core allowing the
document cache to leak into it?  Just seems strange that a second empty
core allows the system to run out of heap.

-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


 From: Mark Miller markrmil...@gmail.com
 Reply-To: solr-user@lucene.apache.org
 Date: Tue, 06 Oct 2009 17:21:47 -0400
 To: solr-user@lucene.apache.org
 Subject: Re: Solr Trunk Heap Space Issues
 
 Mark Miller wrote:
 Jeff Newburn wrote:
   
 So could that potentially explain our use of more ram on indexing? Or is
 this a rare edge case.
   
 
 I think it could explain the JVM using more RAM while indexing - but it
 should be fairly easily recoverable from what I can tell - so no
 explanation on the OOM yet. Still looking at that one.
 
 Is you system basically stock, or do you have custom plugins in it?
 
   
 No matter what I try with however many cores, I can't duplicate your
 problem.
 
 -- 
 - Mark
 
 http://www.lucidimagination.com

Re: Seattle / PNW Hadoop/Lucene/HBase Meetup, Wed Sep 30th

2009-10-07 Thread Nick Dimiduk

Hey PNW Clouders! I'd really like to chat further with the crew doing
distributed Solr. Give me a ring or shoot me an email, let's do lunch!
-Nick

On Wed, Sep 30, 2009 at 2:10 PM, Nick Dimiduk ndimi...@gmail.com wrote:

 As Bradford is out of town this evening, I will take up the mantel of
 Person-on-Point. Contact me with questions re: tonight's gathering.

 See you tonight!

 -Nick
 614.657.0267


 On Mon, Sep 28, 2009 at 4:33 PM, Bradford Stephens 
 bradfordsteph...@gmail.com wrote:

 Hello everyone!
 Don't forget that the Meetup is THIS Wednesday! I'm looking forward to
 hearing about Hive from the Facebook team ... and there might be a few
 other
 interesting talks as well. Here's the details in the wiki:
 http://wiki.apache.org/hadoop/PNW_Hadoop_%2B_Apache_Cloud_Stack_User_Group

 Cheers,
 Bradford

 On Mon, Sep 14, 2009 at 11:35 AM, Bradford Stephens 
 bradfordsteph...@gmail.com wrote:

  Greetings,
 
  It's time for another Hadoop/Lucene/ApacheCloud  Stack meetup!
  This month it'll be on Wednesday, the 30th, at 6:45 pm.
 
  We should have a few interesting guests this time around -- someone from
  Facebook may be stopping by to talk about Hive :)
 
  We've had great attendance in the past few months, let's keep it up! I'm
  always
  amazed by the things I learn from everyone.
 
  We're back at the University of Washington, Allen Computer Science
  Center (not Computer Engineering)
  Map: http://www.washington.edu/home/maps/?CSE
 
  Room: 303 -or- the Entry level. If there are changes, signs will be
 posted.
 
  More Info:
 
  The meetup is about 2 hours (and there's usually food): we'll have two
  in-depth talks of 15-20
  minutes each, and then several lightning talks of 5 minutes. If no
  one offers, We'll then have discussion and 'social time'.  we'll just
  have general discussion. Let net know if you're interested in speaking
  or attending. We'd like to focus on education, so every presentation
  *needs* to ask some questions at the end. We can talk about these
  after the presentations, and I'll record what we've learned in a wiki
  and share that with the rest of us.
 
  Contact: Bradford Stephens, 904-415-3009, bradfordsteph...@gmail.com
 
  Cheers,
  Bradford
  --
  http://www.roadtofailure.com -- The Fringes of Scalability, Social
  Media, and Computer Science
 



 --
 http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
 and Computer Science

Re: How to retrieve the index of a string within a field?

2009-10-07 Thread Elaine Li

Hi Sandeep,

Say the field field name=sentenceCan you get what you
want?/field, the field type is Text.

My query contains 'sentence:get what you'. Is it possible to get
number 2 directly from a query since the word 'get' is the 2nd token
in the sentence?

Thanks.

Elaine

On Wed, Oct 7, 2009 at 8:12 AM, Sandeep Tagore sandeep.tag...@gmail.com wrote:

 Hi Elaine,
 What do you mean by index of this word.. do you want to return the first
 occurrence of the word in that sentence or the document id.
 Also which type of field is it? is it a Text or String? If that is of type
 Text.. u can't achieve that because the sentence will be tokenized.

 Sandeep


 Elaine Li wrote:

 I have a field. The field has a sentence. If the user types in a word
 or a phrase, how can I return the index of this word or the index of
 the first word of the phrase?
 I tried to use bf=ord..., but it does not work as i expected.


 --
 View this message in context: 
 http://www.nabble.com/How-to-retrieve-the-index-of-a-string-within-a-field--tp25771821p25783936.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to retrieve the index of a string within a field?

Hi Elaine,
You can achieve that with some modifications in sol configuration files.
Generally text will be configured as
fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

When a field is declared as text(with above conf.) it will tokenized. Say,
for example, your sentence
Can you get what you want? will become be tokenized like can, you, get,
what, you, want. So when you search for 'sentence:get what you' you will
get 0 results.

To achieve your objective you can remove Tokenizers in text configuration.
The best way I suggest is to declare the field as type string. Search the
string with wild card like 'sentence:*get what you*' using sorlj client
and when you get try to records (results) save the output of
sentence.indexOf(keyword) in your java bean. Here sentence is a variable
declared in the java bean.
For more details you need to read the usage of Solrj. If you have any issues
in modifying the configuration post the configuration you have for the
fieldtype text and i will modify it for you.

Regards,
Sandeep Team

Elaine Li wrote:

Say the field field name=sentenceCan you get what you
want?/field, the field type is Text.

My query contains 'sentence:get what you'. Is it possible to get
number 2 directly from a query since the word 'get' is the 2nd token
in the sentence?

--
View this message in context:
http://www.nabble.com/How-to-retrieve-the-index-of-a-string-within-a-field--tp25771821p25788406.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-07 Thread Tricia Williams


Chris Hostetter wrote:

: I would expect field:2001-03 to be a hit on a partial match such as
: field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my
: expectation would be that field:2001-03 would be counted once per day for each
: day in its range. It would follow that a user looking for documents relating

...meanwhile someone else might expect that unless the ambiguous date must 
be entirely contained within the range being queried on.
  
If implemented in DateField I guess this behaviour would need to be 
configurable.
(your implication of counting once per day would have pretty weird results 
on faceting by the way)
  
I agree.  It would be possible to have one document hit on a query but 
have hundreds of facet categories with a count of one under this 
scheme.  I'm leaning towards the scenario I described where the document 
would be counted once in an other facet category if it is relevant 
through rounding.
with unambiguous dates, you can have exactly what you want just by being a 
little more verbose when indexing/quering, (and somoene else can have 
exactly what they want by being equally verbose using slightly differnet 
options/queries


in your case: i would suggest that you use two fields: date_low and 
date_high ... when you have an exact date (down to the smallest level of 
granularity you care about) you put the same value in both fields, when 
you have an ambiguous value (like 2001-03) you put the largest value 
possible in date_high and the lowest value possible in date_low (ie: 
date_low:2001-03-01T00:00:00Z  date_high:2001-03-31T23:59:59.999Z) then a 
query for anything *overlapping* the range from feb28 to march 13 would 
be...


+date_low:[* TO 2001-03-13T00:00:00Z] +date_high:[2001-02-28T00:00:00Z TO *]

...it works for ambiguous dates, and it works for exact dates.

(someone else who only wants to see matches if the ranges *completely* 
overlap would just swap which end point they queried against which field)
  
We've had a really similar solution in place for range queries for a 
while.  Our current problem is really faceting.


Thanks,
Tricia

how can I use debugQuery if I have extended QParserPlugin?

2009-10-07 Thread gdeconto


in a previous post, I asked how I would go about creating virtual function in
my solr query; ie: http://127.0.0.1:8994/solr/select...@myfunc(1,2,3,4)

I was trying to find a way to more easily/cleanly perform queries against
large numbers of dynamic fields (ie field1, field2, field3...field99).

I have extended QParserPlugin so that I can do this. the extended method
replaces the virtual function section of the query with an expanded set of
fields; @myFunc(1,2,3,4) can become something like (A1:1 AND B1:2 AND
C1:3 AND D1:4) OR (A2:1 AND B2:2 AND C2:3 AND D2:4) OR ... (A99:1 AND B99:2
AND C99:3 AND D99:4))

one thing I noticed is that if I append debugQuery to a query that includes
the virtual function, I get a NullPointerException, likely because the
debugging code looks at the query passed in and not the expanded list that
my code generates.

I would like to be able to use debugQuery to analyse my queries, including
those with the virtual function.

What would I have to modify to get debugQuery to work??

thx in advance.
-- 
View this message in context: 
http://www.nabble.com/how-can-I-use-debugQuery-if-I-have-extended-QParserPlugin--tp25789546p25789546.html
Sent from the Solr - User mailing list archive at Nabble.com.

IndexWriter InfoStream in solrconfig not working

2009-10-07 Thread Burton-West, Tom

Hello,

We are trying to debug an indexing/optimizing problem and have tried setting 
the infoStream  file in solrconf.xml so that the SolrIndexWriter will write a 
log file.  Here is our setting:

!--
 To aid in advanced debugging, you may turn on IndexWriter debug 
logging. Uncommenting this and setting to true
 will set the file that the underlying Lucene IndexWriter will write 
its debug infostream to.
--
infoStream file=/tmp/LuceneIndexWriterDebug.logtrue/infoStream

After making that change to solrconfig.xml, restarting Solr, we see a message 
in the tomcat logs saying that the log is enabled:

build-2_log.2009-10-06.txt:INFO: IndexWriter infoStream debug log is enabled: 
/tmp/LuceneIndexWriterDebug.log

However, if we then run an optimize we can't see any log file being written.

I also looked at the patch for  http://issues.apache.org/jira/browse/SOLR-1145, 
but did not see a unit test that I might try to run in our system.


Do others have this logging working successfully ?
Is there something else that needs to be set up?

Tom

Re: How much disk space does optimize really take

On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote:

 In a separate thread, I've detailed how an optimize is taking  2x disk
 space. We don't use solr distribution/snapshooter.  We are using the default
 deletion policy = 1. We can't optimize a 192G index in 400GB of space.

 This thread in lucene/java-user

 http://www.gossamer-threads.com/lists/lucene/java-user/43475

 suggests that an optimize should not take  2x unless perhaps an IndexReader
 is holding on to segments. This could be our problem since when optimization
 runs out of space, if we stop tomcat, a number of files go away and space is
 recovered.

 But we are not searching the index so how could a Searcher/IndexReader have
 any segments open?

 I notice in the logs that as part of routine commits or as part of optimize
 a Searcher is registered and autowarmed from a previous searcher (of course
 there's nothing in the caches -- this is just a build machine).

 INFO: registering core:
 Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
 INFO: [] Registered new searcher searc...@2e097617 main

 Does this means that there's always a lucene IndexReader holding segment
 files open so they can't be deleted during an optimize so we run out of disk
 space  2x?

Yes.
A feature could probably now be developed now that avoids opening a
reader until it's requested.
That wasn't really possible in the past - due to many issues such as
Lucene autocommit.

-Yonik
http://www.lucidimagination.com

Re: TermsComponent or auto-suggest with filter

2009-10-07 Thread Jay Hill

Something like this, building on each character typed:

facet=onfacet.field=tc_queryfacet.prefix=befacet.mincount=1

-Jay
http://www.lucidimagination.com


On Tue, Oct 6, 2009 at 5:43 PM, R. Tan tanrihae...@gmail.com wrote:

 Nice. In comparison, how do you do it with faceting?

  Two other approaches are to use either the TermsComponent (new in Solr
  1.4) or faceting.



 On Wed, Oct 7, 2009 at 1:51 AM, Jay Hill jayallenh...@gmail.com wrote:

  Have a look at a blog I posted on how to use EdgeNGrams to build an
  auto-suggest tool:
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
 
  You could easily add filter queries to this approach. Ffor example, the
  query used in the blog could add filter queries like this:
 
  http://localhost:8983/solr/select/?q=user_query:
  ”i”wt=jsonfl=user_queryindent=onechoParams=nonerows=10sort=count
  descfq=yourField:yourQueryfq=anotherField:anotherQuery
 
  -Jay
  http://www.lucidimagination.com
 
 
 
 
  On Tue, Oct 6, 2009 at 4:40 AM, R. Tan tanrihae...@gmail.com wrote:
 
   Hello,
   What's the best way to get auto-suggested terms/keywords that is
 filtered
   by
   one or more fields? TermsComponent should have been the solution but
   filters
   are not supported.
  
   Thanks,
   Rihaed

Facet query pb


Hello
I have a pb trying to retrieve a tree with facet use

I 've got a field location_field
Each doc in my index has a location_field

Location field can be
continent/country/city


I have 2 queries:

http://server/solr//select?fq=(location_field:NORTH*) : ok, retrieve docs

http://server/solr//select?fq=(location_field:NORTH AMERICA*) : not ok


I think with NORTH AMERICA I have a pb with the space caractere

Could u help me



-- 
View this message in context: 
http://www.nabble.com/Facet-query-pb-tp25790667p25790667.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How much disk space does optimize really take

2009-10-07 Thread Jason Rutherglen

It would be good to be able to commit without opening a new
reader however with Lucene 2.9 the segment readers for all
available segments are already created and available via
getReader which manages the reference counting internally.

Using reopen redundantly creates SRs that are already held
internally in IW.

On Wed, Oct 7, 2009 at 9:59 AM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote:

 In a separate thread, I've detailed how an optimize is taking  2x disk
 space. We don't use solr distribution/snapshooter.  We are using the default
 deletion policy = 1. We can't optimize a 192G index in 400GB of space.

 This thread in lucene/java-user

 http://www.gossamer-threads.com/lists/lucene/java-user/43475

 suggests that an optimize should not take  2x unless perhaps an IndexReader
 is holding on to segments. This could be our problem since when optimization
 runs out of space, if we stop tomcat, a number of files go away and space is
 recovered.

 But we are not searching the index so how could a Searcher/IndexReader have
 any segments open?

 I notice in the logs that as part of routine commits or as part of optimize
 a Searcher is registered and autowarmed from a previous searcher (of course
 there's nothing in the caches -- this is just a build machine).

 INFO: registering core:
 Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
 INFO: [] Registered new searcher searc...@2e097617 main

 Does this means that there's always a lucene IndexReader holding segment
 files open so they can't be deleted during an optimize so we run out of disk
 space  2x?

 Yes.
 A feature could probably now be developed now that avoids opening a
 reader until it's requested.
 That wasn't really possible in the past - due to many issues such as
 Lucene autocommit.

 -Yonik
 http://www.lucidimagination.com

Re: Facet query pb

I have no idea what pb mean but this is what you probably want -
fq=(location_field:(NORTH AMERICA*))

Cheers
Avlesh

On Wed, Oct 7, 2009 at 10:40 PM, clico cl...@mairie-marseille.fr wrote:


 Hello
 I have a pb trying to retrieve a tree with facet use

 I 've got a field location_field
 Each doc in my index has a location_field

 Location field can be
 continent/country/city


 I have 2 queries:

 http://server/solr//select?fq=(location_field:NORTH*)http://server/solr//select?fq=%28location_field:NORTH*%29:
  ok, retrieve docs

 http://server/solr//select?fq=(location_field:NORTHhttp://server/solr//select?fq=%28location_field:NORTHAMERICA*)
  : not ok


 I think with NORTH AMERICA I have a pb with the space caractere

 Could u help me



 --
 View this message in context:
 http://www.nabble.com/Facet-query-pb-tp25790667p25790667.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facet query pb

2009-10-07 Thread Christian Zambrano


Clico,

Because you are doing a wildcard query, the token 'AMERICA' will not be 
analyzed at all. This means that 'AMERICA*' will NOT match 'america'.


On 10/07/2009 12:30 PM, Avlesh Singh wrote:

I have no idea what pb mean but this is what you probably want -
fq=(location_field:(NORTH AMERICA*))

Cheers
Avlesh

On Wed, Oct 7, 2009 at 10:40 PM, clicocl...@mairie-marseille.fr  wrote:

   

Hello
I have a pb trying to retrieve a tree with facet use

I 've got a field location_field
Each doc in my index has a location_field

Location field can be
continent/country/city


I have 2 queries:

http://server/solr//select?fq=(location_field:NORTH*)http://server/solr//select?fq=%28location_field:NORTH*%29:
 ok, retrieve docs

http://server/solr//select?fq=(location_field:NORTHhttp://server/solr//select?fq=%28location_field:NORTHAMERICA*)
 : not ok


I think with NORTH AMERICA I have a pb with the space caractere

Could u help me



--
View this message in context:
http://www.nabble.com/Facet-query-pb-tp25790667p25790667.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How much disk space does optimize really take

On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 It would be good to be able to commit without opening a new
 reader however with Lucene 2.9 the segment readers for all
 available segments are already created and available via
 getReader which manages the reference counting internally.

 Using reopen redundantly creates SRs that are already held
 internally in IW.


Jason, I think this is something we should consider changing. A user who is
not using NRT features should not pay the price of keeping readers opened.
We are also interested in opening a searcher just-in-time for SOLR-1293. We
have use-cases where a SolrCore is loaded only for indexing and then
unloaded.

-- 
Regards,
Shalin Shekhar Mangar.

Re: How much disk space does optimize really take

I think that argument requires auto commit to be on and opening readers
after the optimize starts? Otherwise, the optimized version is not put
into place until a commit is called, and a Reader won't see the newly
merged segments until then - so the original index is kept around in
either case - having a Reader open on it shouldn't affect the space
requirements?

Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote:
   
 In a separate thread, I've detailed how an optimize is taking  2x disk
 space. We don't use solr distribution/snapshooter.  We are using the default
 deletion policy = 1. We can't optimize a 192G index in 400GB of space.

 This thread in lucene/java-user

 http://www.gossamer-threads.com/lists/lucene/java-user/43475

 suggests that an optimize should not take  2x unless perhaps an IndexReader
 is holding on to segments. This could be our problem since when optimization
 runs out of space, if we stop tomcat, a number of files go away and space is
 recovered.

 But we are not searching the index so how could a Searcher/IndexReader have
 any segments open?

 I notice in the logs that as part of routine commits or as part of optimize
 a Searcher is registered and autowarmed from a previous searcher (of course
 there's nothing in the caches -- this is just a build machine).

 INFO: registering core:
 Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
 INFO: [] Registered new searcher searc...@2e097617 main

 Does this means that there's always a lucene IndexReader holding segment
 files open so they can't be deleted during an optimize so we run out of disk
 space  2x?
 

 Yes.
 A feature could probably now be developed now that avoids opening a
 reader until it's requested.
 That wasn't really possible in the past - due to many issues such as
 Lucene autocommit.

 -Yonik
 http://www.lucidimagination.com
   


-- 
- Mark

http://www.lucidimagination.com

RE: IndexWriter InfoStream in solrconfig not working

2009-10-07 Thread Giovanni Fernandez-Kincade

I had the same problem. I'd be very interested to know how to get this 
working...

-Gio.

-Original Message-
From: Burton-West, Tom [mailto:tburt...@umich.edu] 
Sent: Wednesday, October 07, 2009 12:13 PM
To: solr-user@lucene.apache.org
Subject: IndexWriter InfoStream in solrconfig not working

Hello,

We are trying to debug an indexing/optimizing problem and have tried setting 
the infoStream  file in solrconf.xml so that the SolrIndexWriter will write a 
log file.  Here is our setting:

!--
 To aid in advanced debugging, you may turn on IndexWriter debug 
logging. Uncommenting this and setting to true
 will set the file that the underlying Lucene IndexWriter will write 
its debug infostream to.
--
infoStream file=/tmp/LuceneIndexWriterDebug.logtrue/infoStream

After making that change to solrconfig.xml, restarting Solr, we see a message 
in the tomcat logs saying that the log is enabled:

build-2_log.2009-10-06.txt:INFO: IndexWriter infoStream debug log is enabled: 
/tmp/LuceneIndexWriterDebug.log

However, if we then run an optimize we can't see any log file being written.

I also looked at the patch for  http://issues.apache.org/jira/browse/SOLR-1145, 
but did not see a unit test that I might try to run in our system.

Do others have this logging working successfully ?
Is there something else that needs to be set up?

Tom

Default query parameter for one core

2009-10-07 Thread Michael

I'd like to have 5 cores on my box.  core0 should automatically shard to
cores 1-4, which each have a quarter of my corpus.
I tried this in my solrconfig.xml:

  requestHandler name=standard class=solr.SearchHandler default=true
 lst name=defaults
   str name=shards${solr.core.shardsParam:}/str !-- aka, if the
core specifies a shardsParam, great, and if not, use nothing --
 /lst
  /requestHandler

and this in my solr.xml:

cores adminPath=/admin/cores shareSchema=true
  core name=core0 instanceDir=./

 
shardsParam=localhost:9990/core1,localhost:9990/core2,localhost:9990/core3,localhost:9990/core4
/
  core name=core1 instanceDir=./ dataDir=/home/search/data/1/
  !-- etc for cores 2 through 4 --
/cores

Unfortunately, this doesn't work, because cores 1 through 4 end up
specifying a blank shards param, which is different from no shards param at
all -- it results in a NullPointerException.

Is there a way to not have the shards param at all for most cores, and for
core0 to specify it?

Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber




Yonik Seeley wrote:



Does this means that there's always a lucene IndexReader holding segment
files open so they can't be deleted during an optimize so we run out of disk
space  2x?


Yes.
A feature could probably now be developed now that avoids opening a
reader until it's requested.
That wasn't really possible in the past - due to many issues such as
Lucene autocommit.



So this implies that for a normal optimize, in every case, due to the 
Searcher holding open the existing segment prior to optimize that we'd 
always need 3x even in the normal case.


This seems wrong since it is repeated stated that in the normal case 
only 2x is needed and I have successfully optimized a similar sized 192G 
index on identical hardware with a 400G capacity.


Yonik, I'm uncertain then about what you're saying about required disk 
space ofr optimize.  Could you clarify?





-Yonik
http://www.lucidimagination.com

Re: Facet query pb

2009-10-07 Thread Todd Benge

Aq

On 10/7/09, clico cl...@mairie-marseille.fr wrote:

 Hello
 I have a pb trying to retrieve a tree with facet use

 I 've got a field location_field
 Each doc in my index has a location_field

 Location field can be
 continent/country/city


 I have 2 queries:

 http://server/solr//select?fq=(location_field:NORTH*) : ok, retrieve docs

 http://server/solr//select?fq=(location_field:NORTH AMERICA*) : not ok


 I think with NORTH AMERICA I have a pb with the space caractere

 Could u help me



 --
 View this message in context:
 http://www.nabble.com/Facet-query-pb-tp25790667p25790667.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Sent from my mobile device

Re: How much disk space does optimize really take

2009-10-07 Thread Jason Rutherglen

To be clear, the SRs created by merges don't have the term index
loaded which is the main cost.  One would need to use
IndexReaderWarmer to load the term index before the new SR becomes a
part of SegmentInfos.

On Wed, Oct 7, 2009 at 10:34 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 It would be good to be able to commit without opening a new
 reader however with Lucene 2.9 the segment readers for all
 available segments are already created and available via
 getReader which manages the reference counting internally.

 Using reopen redundantly creates SRs that are already held
 internally in IW.


 Jason, I think this is something we should consider changing. A user who is
 not using NRT features should not pay the price of keeping readers opened.
 We are also interested in opening a searcher just-in-time for SOLR-1293. We
 have use-cases where a SolrCore is loaded only for indexing and then
 unloaded.

 --
 Regards,
 Shalin Shekhar Mangar.

Re: How to retrieve the index of a string within a field?

2009-10-07 Thread Elaine Li

Sandeep, I do get results when I search for get what you, not 0 results.

What in my schema makes this difference?

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
 enablePositionIncrements=true ensures that a 'gap' is left to
 allow for accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
!--filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/ --
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
!--filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/ --
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
!--filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/ --
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

I need to learn Solrj. I am currently using javascript as a client and
invoke http calls to get results to display in the browser. Can Solrj
get all the results at one short w/o the http call? I need to do some
postprocessing against all the results and then display the processed
data. Submitting multiple http queries and post-process after each
query does not seem to be the right way.

Thanks.

Elaine

On Wed, Oct 7, 2009 at 11:06 AM, Sandeep Tagore
sandeep.tag...@gmail.com wrote:

 Hi Elaine,
 You can achieve that with some modifications in sol configuration files.
 Generally text will be configured as
 fieldType name=text class=solr.TextField positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
        filter class=solr.LowerCaseFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
        filter class=solr.LowerCaseFilterFactory/
      /analyzer
    /fieldType

 When a field is declared as text(with above conf.) it will tokenized. Say,
 for example, your sentence
 Can you get what you want? will become be tokenized like can, you, get,
 what, you, want. So when you search for 'sentence:get what you' you will
 get 0 results.

 To achieve your objective you can remove Tokenizers in text configuration.
 The best way I suggest is to declare the field as type string. Search the
 string with wild card like 'sentence:*get what you*' using sorlj client
 and when you get try to records (results) save the output of
 sentence.indexOf(keyword) in your java bean. Here sentence is a variable
 declared in the java bean.
 For more details you need to read the usage of Solrj. If you have any issues
 in modifying the configuration post the configuration you have for the
 fieldtype text and i will modify it for you.

 Regards,
 Sandeep Team


 Elaine Li wrote:

 Say the field field name=sentenceCan you get what you
 want?/field, the field type is Text.

 My query contains 'sentence:get what you'. Is it possible to get
 number 2 directly from a query since the word 'get' is the 2nd token
 in the sentence?


 --
 View this message in context: 
 http://www.nabble.com/How-to-retrieve-the-index-of-a-string-within-a-field--tp25771821p25788406.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Question about PatternReplace filter and automatic Synonym generation

2009-10-07 Thread Prasanna Ranganathan



On 10/6/09 3:32 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 :  I ll try to explain with an example. Given the term 'it!' in the title, it
 : should match both 'it' and 'it!' in the query as an exact match. Currently,
 : this is done by using a synonym entry  (and index time SynonymFilter) as
 : follows:
 : 
 :  it! = it, it!
 : 
 :  Now, the above holds true for all cases where you have a title token of the
 : form [aA-zZ]*!. Handling all of those cases requires adding synonyms
 : manually for each case which is not easy to manage and does not scale.
 : 
 :  I am hoping to do the same by using a index time filter that takes in a
 : pattern like the PatternReplace filter and adds the newly created token
 : instead of replacing the original one. Does this make sense? Am I missing
 : something that would break this approach?
 
 something like this would be fairly easy to implement in Lucene, but
 somewhat confusing to try and configure in Solr.  I was going to suggest
 that you use something like...
  filter class=solr.PatternReplaceFilterFactory
 pattern=(^.*)\!?$) replacement=$1 $2 replace=all /
 
 ..and then have a subsequent filter that splits the tokens on the
 whitespace (or any other special character you could use in the
 replacement) ... but aparently we don't have any built in filters that
 will just split tokens on a character/pattern for you.  that would also be
 fairly easy to write if someone wnats to submit a patch.

 There is a Solr.PatternTokenizerFactory class which likely fits the bill in
this case. The related question I have is this - is it possible to have
multiple Tokenizers in your analysis chain?

Prasanna.

How to determine the size of the index?

2009-10-07 Thread Fishman, Vladimir


 Is this info available via admin page?

Re: Facet query pb


That's not a pb
I want to use that in order to drill down a tree


Christian Zambrano wrote:
 
 Clico,
 
 Because you are doing a wildcard query, the token 'AMERICA' will not be 
 analyzed at all. This means that 'AMERICA*' will NOT match 'america'.
 
 On 10/07/2009 12:30 PM, Avlesh Singh wrote:
 I have no idea what pb mean but this is what you probably want -
 fq=(location_field:(NORTH AMERICA*))

 Cheers
 Avlesh

 On Wed, Oct 7, 2009 at 10:40 PM, clicocl...@mairie-marseille.fr  wrote:


 Hello
 I have a pb trying to retrieve a tree with facet use

 I 've got a field location_field
 Each doc in my index has a location_field

 Location field can be
 continent/country/city


 I have 2 queries:

 http://server/solr//select?fq=(location_field:NORTH*)http://server/solr//select?fq=%28location_field:NORTH*%29:
 ok, retrieve docs

 http://server/solr//select?fq=(location_field:NORTHhttp://server/solr//select?fq=%28location_field:NORTHAMERICA*)
 : not ok


 I think with NORTH AMERICA I have a pb with the space caractere

 Could u help me



 --
 View this message in context:
 http://www.nabble.com/Facet-query-pb-tp25790667p25790667.html
 Sent from the Solr - User mailing list archive at Nabble.com.


  

 
 

-- 
View this message in context: 
http://www.nabble.com/Facet-query-pb-tp25790667p25792177.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How much disk space does optimize really take

On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote:
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case only 2x
 is needed and I have successfully optimized a similar sized 192G index on
 identical hardware with a 400G capacity.

2x for the IndexWriter only.
Having an open index reader can increase that somewhat... 3x is the
absolute worst case I think and that can currently be avoided by first
calling commit and then calling optimize I think.  This way the open
reader will only be holding references to segments that wouldn't be
deleted until the optimize is complete anyway.


-Yonik
http://www.lucidimagination.com

Re: How much disk space does optimize really take

2009-10-07 Thread Michael McCandless

On Wed, Oct 7, 2009 at 1:34 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 It would be good to be able to commit without opening a new
 reader however with Lucene 2.9 the segment readers for all
 available segments are already created and available via
 getReader which manages the reference counting internally.

 Using reopen redundantly creates SRs that are already held
 internally in IW.


 Jason, I think this is something we should consider changing. A user who is
 not using NRT features should not pay the price of keeping readers opened.
 We are also interested in opening a searcher just-in-time for SOLR-1293. We
 have use-cases where a SolrCore is loaded only for indexing and then
 unloaded.

This is already true today.

If you don't use NRT then the readers are not held open by Lucene.

Mike

Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber

Wow, this is weird.  I commit before I optimize.  In fact, I bounce 
tomcat before I optimize just in case. It makse sense, as you say, that 
then the open reader can only be holding references to segments that 
wouldn't be deleted until the optimize is complete anyway.


But we're still exceeding 2x. And after the optimize fails, if we then 
do a commit or bounce tomcat, a bunch of segments disappear. I am stumped.


Yonik Seeley wrote:

On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote:

So this implies that for a normal optimize, in every case, due to the
Searcher holding open the existing segment prior to optimize that we'd
always need 3x even in the normal case.

This seems wrong since it is repeated stated that in the normal case only 2x
is needed and I have successfully optimized a similar sized 192G index on
identical hardware with a 400G capacity.


2x for the IndexWriter only.
Having an open index reader can increase that somewhat... 3x is the
absolute worst case I think and that can currently be avoided by first
calling commit and then calling optimize I think.  This way the open
reader will only be holding references to segments that wouldn't be
deleted until the optimize is complete anyway.


-Yonik
http://www.lucidimagination.com

Re: How much disk space does optimize really take

Oops, send before finished.  Partial Optimize aka maxSegments is a
recent Solr 1.4/Lucene 2.9 feature.

As to 2x v.s. 3x, the general wisdom is that an optimize on a simple
index takes at most 2x disk space, and on a compound index takes at
most 3x. Simple is the default (*). At Divvio we had the same
problem and it never took up more than 2x.

If your index disks are really bursting at the seams, you could try
creating an empty index on a separate disk and merging your large
index into that index. The resulting index will be mostly optimized.

Lance Norskog

* in solrconfig.xml:
useCompoundFilefalse/useCompoundFile

On 10/7/09, Phillip Farber pfar...@umich.edu wrote:
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce
 tomcat before I optimize just in case. It makse sense, as you say, that
 then the open reader can only be holding references to segments that
 wouldn't be deleted until the optimize is complete anyway.

 But we're still exceeding 2x. And after the optimize fails, if we then
 do a commit or bounce tomcat, a bunch of segments disappear. I am stumped.

 Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote:
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case only
 2x
 is needed and I have successfully optimized a similar sized 192G index on
 identical hardware with a 400G capacity.

 2x for the IndexWriter only.
 Having an open index reader can increase that somewhat... 3x is the
 absolute worst case I think and that can currently be avoided by first
 calling commit and then calling optimize I think.  This way the open
 reader will only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.


 -Yonik
 http://www.lucidimagination.com



-- 
Lance Norskog
goks...@gmail.com

Re: How much disk space does optimize really take

On Wed, Oct 7, 2009 at 3:16 PM, Phillip Farber pfar...@umich.edu wrote:
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce tomcat
 before I optimize just in case. It makse sense, as you say, that then the
 open reader can only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.

 But we're still exceeding 2x.

How much over 2x?
It is possible (though relatively rare) for an optimized index to be
larger than a non-optimized index.

-Yonik
http://www.lucidimagination.com

Re: How much disk space does optimize really take

I can't tell why calling a commit or restarting is going to help
anything - or why you need more than 2x in any case. The only reason i
can see this being is if you have turned on auto-commit. Otherwise the
Reader is *always* only referencing what would have to be around anyway.

Your likely to just too close to the edge. There are fragmentation
issues and whatnot when your dealing with such large files and so little
space above what you need.

Phillip Farber wrote:
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce
 tomcat before I optimize just in case. It makse sense, as you say,
 that then the open reader can only be holding references to segments
 that wouldn't be deleted until the optimize is complete anyway.

 But we're still exceeding 2x. And after the optimize fails, if we then
 do a commit or bounce tomcat, a bunch of segments disappear. I am
 stumped.

 Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu
 wrote:
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case
 only 2x
 is needed and I have successfully optimized a similar sized 192G
 index on
 identical hardware with a 400G capacity.

 2x for the IndexWriter only.
 Having an open index reader can increase that somewhat... 3x is the
 absolute worst case I think and that can currently be avoided by first
 calling commit and then calling optimize I think.  This way the open
 reader will only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.


 -Yonik
 http://www.lucidimagination.com


-- 
- Mark

http://www.lucidimagination.com

Re: How much disk space does optimize really take

Okay - I think I've got you - your talking about the case of adding a
bunch of docs, not calling commit, and then trying to optimize. I keep
coming at it from a cold optimize. Making sense to me now.

Mark Miller wrote:
 I can't tell why calling a commit or restarting is going to help
 anything - or why you need more than 2x in any case. The only reason i
 can see this being is if you have turned on auto-commit. Otherwise the
 Reader is *always* only referencing what would have to be around anyway.

 Your likely to just too close to the edge. There are fragmentation
 issues and whatnot when your dealing with such large files and so little
 space above what you need.

 Phillip Farber wrote:
   
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce
 tomcat before I optimize just in case. It makse sense, as you say,
 that then the open reader can only be holding references to segments
 that wouldn't be deleted until the optimize is complete anyway.

 But we're still exceeding 2x. And after the optimize fails, if we then
 do a commit or bounce tomcat, a bunch of segments disappear. I am
 stumped.

 Yonik Seeley wrote:
 
 On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu
 wrote:
   
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case
 only 2x
 is needed and I have successfully optimized a similar sized 192G
 index on
 identical hardware with a 400G capacity.
 
 2x for the IndexWriter only.
 Having an open index reader can increase that somewhat... 3x is the
 absolute worst case I think and that can currently be avoided by first
 calling commit and then calling optimize I think.  This way the open
 reader will only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.


 -Yonik
 http://www.lucidimagination.com
   


   


-- 
- Mark

http://www.lucidimagination.com

Re: How much disk space does optimize really take

On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller markrmil...@gmail.com wrote:
 I can't tell why calling a commit or restarting is going to help
 anything

Depends on what scenarios you consider, and what you are taking 2x of.

1) Open reader on index
2) Open writer and add two documents... the first causes a large
merge, and the second is just to make it a non-optimized index.
  At this point youre already at 2x of your original index size.
3) call optimize()... this will make a 3rd copy before deleting the 2nd.

-Yonik
http://www.lucidimagination.com

Solr Demo at SF New Tech Meetup

2009-10-07 Thread Nasseam Elkarra


Hello all,

For those of you in the Bay Area, we will be demoing our Bodukai  
Boutique product at the SF New Tech Meetup on Wednesday, Oct. 14:

http://sfnewtech.com/2009/10/05/1014-sf-new-tech-bodukai-yourversion-meehive-and-more/

Bodukai Boutique is the fastest ecommerce search and navigation  
solution:

http://bodukai.com/boutique/

We will be demoing our Solr integration and all are welcome to come.

Thank you,

Nasseam Elkarra
http://bodukai.com/boutique/
The fastest possible shopping experience

Re: manage rights

There are no security features in Solr 1.4. You cannot do this.

It would be really simple to implement a hack where all management
must be done via POST, and then allow the configuration to ban POST
requests.

On 10/7/09, clico cl...@mairie-marseille.fr wrote:

 Hi everybody
 As I'm ready to deploy my solr server (after many tests and use cases)
 I'd like ton configure my server in order that some request cannot be post

 As an example :
 My CMS or data app can use
 - dataimport
 - and other indexing  commands

 My website can only perform a search on the server

 could one explain me where this configuration has to be done?

 Thanks
 --
 View this message in context:
 http://www.nabble.com/manage-rights-tp25784152p25784152.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com

Re: solr reporting tool adapter

The BIRT project can do what you want. It has a nice form creator and
you can configure http XML input formats.

It includes very complete Eclipse plugins and there is a book about it.


On 10/7/09, Shalin Shekhar Mangar shalinman...@gmail.com wrote:
 On Wed, Oct 7, 2009 at 2:51 PM, Rakhi Khatwani rkhatw...@gmail.com wrote:

 we basically wanna generate PDF reports which contain, tag clouds, bar
 charts, pie charts etc.


 Faceting on a field will give you top terms and frequency information which
 can be used to create tag clouds. What do you want to plot on a bar chart?

 I don't know of a reporting tool which can hook into Solr for creating such
 things.

 --
 Regards,
 Shalin Shekhar Mangar.



-- 
Lance Norskog
goks...@gmail.com

Re: How much disk space does optimize really take

Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller markrmil...@gmail.com wrote:
   
 I can't tell why calling a commit or restarting is going to help
 anything
 

 Depends on what scenarios you consider, and what you are taking 2x of.

 1) Open reader on index
 2) Open writer and add two documents... the first causes a large
 merge, and the second is just to make it a non-optimized index.
   At this point youre already at 2x of your original index size.
 3) call optimize()... this will make a 3rd copy before deleting the 2nd.

 -Yonik
 http://www.lucidimagination.com
   
Yup - finally hit me what you were talking about. Wasn't considering the
case of adding docs to an existing index, not committing, and then
trying to optimize.

I like trying to take an opposing side from you anyway - it means I know
where I will end up - but your usually so darn terse, I never know how
long till I end up there.

Anyway, so all you generally *need* is 2x, you just have to make sure
your not adding docs first without committing them - which I was taking
for granted. But means your comment of calling commit makes perfect sense.

I guess you can't guarantee 2x though, as if you have queries coming in
that take a while, a commit opening a new Reader will not guarantee the
old Reader is quite ready to go away. Might want to wait a short bit
after the commit.

-- 
- Mark

http://www.lucidimagination.com

Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

2009-10-07 Thread michael8

2 things I noticed that are different from 1.3 to 1.4 for DataImport:

1. there are now 2 datetime values (per my specific schema I'm sure) in the
dataimport.properties vs. only 1 in 1.3 (using the exact same schema). One
is 'last_index_time' same as 1.3, and a *new* one (in 1.4) named
item.last_index_time, where 'item' is my main and only entity name specified
in my data-import.xml. they both have the same value.

2. in 1.3, the datetime passed to SQL used to be, e.g., '2009-10-05
14:08:01', but with 1.4 the format becomes 'Mon Oct 05 14:08:01 PDT 2009',
with the day of week, name of month, and timezone spelled out. I had issue
with the 1.4 format with MySQL only for the timezone part, but now I have a
different solution without using this last index date altogether.

I'm curious though if there's any config setting to pass to
DataImportHandler to specify the desired date/time format to use.

Michael

Noble Paul നോബിള്‍ नोब्ळ्-2 wrote:

really?
I don't remember that being changed.

what difference do u notice?

On Wed, Oct 7, 2009 at 2:30 AM, michael8 mich...@saracatech.com wrote:

Just looking for confirmation from others, but it appears that the
formatting
of last_index_time from dataimport.properties (using DataImportHandler)
is
different in 1.4 vs. that in 1.3. I was troubleshooting why delta
imports
are no longer working for me after moving over to solr 1.4 (10/2 nighly)
and
noticed that format is different.

Michael
--
View this message in context:
http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25776496.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

--
View this message in context:
http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25793468.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Help with denormalizing issues

2009-10-07 Thread Eric Reeves

Hi again, I'm gonna try this again with more focus this time :D

1) Ideally what we would like to do, is plug in an additional mechanism to 
filter the initial result set, because we can't find a way to implement our 
filtering needs as filter queries against a single index.  We would want to do 
this while maintaining support for paging.  Looking through the codebase it 
looks as if this would not be possible without major surgery, due to the paging 
support being implemented deep inside private methods of SolrIndexSearcher.  
Does this sound accurate? 

2) If we pursue the other option of indexing skus and collapsing the results 
based on product id using the field collapsing patch, is there any validity to 
my concerns about indexing the same content multiple times skewing the scoring?

3) Does anyone have experience using the field collapsing patch, and have any 
idea how much additional overhead it incurs?

Thanks,
Eric

-Original Message-
From: Eric Reeves 
Sent: Monday, October 05, 2009 6:19 PM
To: solr-user@lucene.apache.org
Subject: Help with denormalizing issues

Hi there,

I'm evaluating Solr as a replacement for our current search server, and am 
trying to determine what the best strategy would be to implement our business 
needs.  Our problem is that we have a catalog schema with products and skus, 
one to many.  The most relevant content being indexed is at the product level, 
in the name and description fields.  However we are interested in filtering by 
sku attributes, and in particular making multiple filters apply to a single 
sku.  For example, find a product that contains a sku that is both blue and on 
sale.  No approach I've tried at collapsing the sku data into the product 
document works for this.  If we put the data in separate fields, there's no way 
to apply multiple filters to the same sku. and if we concatenate all of the 
relevant sku data into a single multivalued field then as I understand it, this 
is just indexed as one large field with extra whitespace between the individual 
entries, so there's still no way to enforce that an AND filter query applies to 
the same sku.

One approach I was considering was to create separate indexes for products and 
skus, and store the product IDs in the sku documents.  Then we could apply our 
own filters to the initially generated list, based on unique query parameters.  
I thought creating a component between query and facet would be a good place to 
add such a filter, but further research seems to indicate that this would break 
paging and sorting.  The only other thing I can think of would be to subclass 
QueryComponent itself, which looks rather daunting-the process() method has no 
hooks for this sort of thing, it seems I would have to copy the entire existing 
implementation and add them myself, which looks to be a fair chunk of work and 
brittle to changes in the trunk code.  Ideally it would be nice to be able to 
handle certain fq parameters in a completely different way, perhaps using a 
custom query parser, but I haven't wrapped my head around how those work.  Does 
any of this sound remotely doable?  Any advice?

The other suggestion we are looking at was given to us by our current search 
provider, which is to index the skus themselves.  It looks as if we may be able 
to make this work using the field collapsing patch from SOLR-236.  I have some 
concerns about this approach though: 1) It will make for a much larger index 
and longer indexing times (products can have 10 or more skus in our catalog).  
2) Because the indexing will be copying the description and name from the 
product it will be indexing the same content more than once, and the number of 
times per product will vary based on the number of skus.  I'm concerned that 
this may skew the scoring algorithm, in particular the inverse frequency part.  
3) I'm not sure about the performance of the field collapsing patch, I've read 
contradictory reports on the web.

I apologize if this is a bit rambling.  If anyone has any advice for our 
situation it would be very helpful.

Thanks,
Eric

Re: How much disk space does optimize really take

On Wed, Oct 7, 2009 at 3:56 PM, Mark Miller markrmil...@gmail.com wrote:
 I guess you can't guarantee 2x though, as if you have queries coming in
 that take a while, a commit opening a new Reader will not guarantee the
 old Reader is quite ready to go away. Might want to wait a short bit
 after the commit.

Right - and in a complete system, there are other things that can also
hold commit points open longer, like index replication.

-Yonik
http://www.lucidimagination.com

Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

On Thu, Oct 8, 2009 at 1:38 AM, michael8 mich...@saracatech.com wrote:


 2 things I noticed that are different from 1.3 to 1.4 for DataImport:

 1. there are now 2 datetime values (per my specific schema I'm sure) in the
 dataimport.properties vs. only 1 in 1.3 (using the exact same schema).  One
 is 'last_index_time' same as 1.3, and a *new* one (in 1.4) named
 item.last_index_time, where 'item' is my main and only entity name
 specified
 in my data-import.xml.  they both have the same value.


This was added with SOLR-783 to enable delta imports of entities
individually. One can specify the entity name(s) which should be imported.
Without this it was not possible to correctly figure out deltas on a
per-entity basis.


 2. in 1.3, the datetime passed to SQL used to be, e.g., '2009-10-05
 14:08:01', but with 1.4 the format becomes 'Mon Oct 05 14:08:01 PDT 2009',
 with the day of week, name of month, and timezone spelled out.  I had issue
 with the 1.4 format with MySQL only for the timezone part, but now I have a
 different solution without using this last index date altogether.


I just committed SOLR-1496 so the different date format issue is fixed in
trunk.


 I'm curious though if there's any config setting to pass to
 DataImportHandler to specify the desired date/time format to use.


There is no configuration to change this. However, you can write your own
Evaluator to output ${dih.last_index_time} in whatever format you prefer.

-- 
Regards,
Shalin Shekhar Mangar.

Re: manage rights

2009-10-07 Thread Grant Ingersoll

You should also separate your indexer from your searcher and make the  
searcher request handlers allow search only (remove the handlers you  
don't need).  You could also lock down the request parameters that  
they take, too, by using invariants, etc.


Have a look in your solrconfig.xml.  You could, of course, also have a  
ServletFilter in front of Solr or some other type of firewall that  
just throws away the requests you don't wish to support.


And, of course, firewalls can be used, too.

On Oct 7, 2009, at 4:50 PM, Lance Norskog wrote:


There are no security features in Solr 1.4. You cannot do this.

It would be really simple to implement a hack where all management
must be done via POST, and then allow the configuration to ban POST
requests.

On 10/7/09, clico cl...@mairie-marseille.fr wrote:


Hi everybody
As I'm ready to deploy my solr server (after many tests and use  
cases)
I'd like ton configure my server in order that some request cannot  
be post


As an example :
My CMS or data app can use
- dataimport
- and other indexing  commands

My website can only perform a search on the server

could one explain me where this configuration has to be done?

Thanks
--
View this message in context:
http://www.nabble.com/manage-rights-tp25784152p25784152.html
Sent from the Solr - User mailing list archive at Nabble.com.





--
Lance Norskog
goks...@gmail.com


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Help with denormalizing issues

The separate sku do not become one long text string. They are separate
values in the same field. The relevance calculation is completely
separate per value.

The performance problem with the field collapsing patch is that it
does the same thing as a facet or sorting operation: it does a sweep
through the index and builds a data structure whose size depends on
the index. Faceting is not cached directly but still works very
quickly the second time. Sorting has its own cache and is very slow (N
log N) the first time and very fast afterwards. The field collapsing
patch does not cache any of its work and is almost as slow the second
time as the first time.

On 10/7/09, Eric Reeves eree...@eline.com wrote:
 Hi again, I'm gonna try this again with more focus this time :D

 1) Ideally what we would like to do, is plug in an additional mechanism to
 filter the initial result set, because we can't find a way to implement our
 filtering needs as filter queries against a single index.  We would want to
 do this while maintaining support for paging.  Looking through the codebase
 it looks as if this would not be possible without major surgery, due to the
 paging support being implemented deep inside private methods of
 SolrIndexSearcher.  Does this sound accurate?

 2) If we pursue the other option of indexing skus and collapsing the results
 based on product id using the field collapsing patch, is there any validity
 to my concerns about indexing the same content multiple times skewing the
 scoring?

 3) Does anyone have experience using the field collapsing patch, and have
 any idea how much additional overhead it incurs?

 Thanks,
 Eric

 -Original Message-
 From: Eric Reeves
 Sent: Monday, October 05, 2009 6:19 PM
 To: solr-user@lucene.apache.org
 Subject: Help with denormalizing issues

 Hi there,

 I'm evaluating Solr as a replacement for our current search server, and am
 trying to determine what the best strategy would be to implement our
 business needs.  Our problem is that we have a catalog schema with products
 and skus, one to many.  The most relevant content being indexed is at the
 product level, in the name and description fields.  However we are
 interested in filtering by sku attributes, and in particular making multiple
 filters apply to a single sku.  For example, find a product that contains a
 sku that is both blue and on sale.  No approach I've tried at collapsing the
 sku data into the product document works for this.  If we put the data in
 separate fields, there's no way to apply multiple filters to the same sku.
 and if we concatenate all of the relevant sku data into a single multivalued
 field then as I understand it, this is just indexed as one large field with
 extra whitespace between the individual entries, so there's still no way to
 enforce that an AND filter query applies to the same sku.

 One approach I was considering was to create separate indexes for products
 and skus, and store the product IDs in the sku documents.  Then we could
 apply our own filters to the initially generated list, based on unique query
 parameters.  I thought creating a component between query and facet would be
 a good place to add such a filter, but further research seems to indicate
 that this would break paging and sorting.  The only other thing I can think
 of would be to subclass QueryComponent itself, which looks rather
 daunting-the process() method has no hooks for this sort of thing, it seems
 I would have to copy the entire existing implementation and add them myself,
 which looks to be a fair chunk of work and brittle to changes in the trunk
 code.  Ideally it would be nice to be able to handle certain fq parameters
 in a completely different way, perhaps using a custom query parser, but I
 haven't wrapped my head around how those work.  Does any of this sound
 remotely doable?  Any advice?

 The other suggestion we are looking at was given to us by our current search
 provider, which is to index the skus themselves.  It looks as if we may be
 able to make this work using the field collapsing patch from SOLR-236.  I
 have some concerns about this approach though: 1) It will make for a much
 larger index and longer indexing times (products can have 10 or more skus in
 our catalog).  2) Because the indexing will be copying the description and
 name from the product it will be indexing the same content more than once,
 and the number of times per product will vary based on the number of skus.
 I'm concerned that this may skew the scoring algorithm, in particular the
 inverse frequency part.  3) I'm not sure about the performance of the field
 collapsing patch, I've read contradictory reports on the web.

 I apologize if this is a bit rambling.  If anyone has any advice for our
 situation it would be very helpful.

 Thanks,
 Eric



-- 
Lance Norskog
goks...@gmail.com

Problems with WordDelimiterFilterFactory

2009-10-07 Thread Bernadette Houghton

We are having some issues with our solr parent application not retrieving 
records as expected.

For example, if the input query includes a colon (e.g. hot and cold: 
temperatures), the relevant record (which contains a colon in the same place) 
does not get retrieved; if the input query does not include the colon, all is 
fine.  Ditto if the user searches for a query containing hyphens, e.g. asia - 
civilization, although with the qualifier that something like 
asia-civilization (no spaces either side of the hyphen) works fine, whereas 
asia - civilization (spaces either side of hyphen) doesn't work.

Our schema.xml contains the following -

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
--
filter 
class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter 
class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email: 
bernadette.hough...@deakin.edu.aumailto:bernadette.hough...@deakin.edu.au
Website: http://www.deakin.edu.au
http://www.deakin.edu.au/Deakin University CRICOS Provider Code 00113B (Vic)

Important Notice: The contents of this email are intended solely for the named 
addressee and are confidential; any unauthorised use, reproduction or storage 
of the contents is expressly prohibited. If you have received this email in 
error, please delete it and any attachments immediately and advise the sender 
by return email or telephone.
Deakin University does not warrant that this email and any attachments are 
error or virus free

Re: Problems with WordDelimiterFilterFactory

2009-10-07 Thread Christian Zambrano

Could you please provide the exact URL of a query where you are 
experiencing this problem?

eg(Not URL encoded): q=fieldName:hot and cold: temperatures

On 10/07/2009 05:32 PM, Bernadette Houghton wrote:

We are having some issues with our solr parent application not retrieving 
records as expected.

For example, if the input query includes a colon (e.g. hot and cold: temperatures), the relevant record 
(which contains a colon in the same place) does not get retrieved; if the input query does not include 
the colon, all is fine.  Ditto if the user searches for a query containing hyphens, e.g. asia - 
civilization, although with the qualifier that something like asia-civilization (no spaces 
either side of the hyphen) works fine, whereas asia - civilization (spaces either side of 
hyphen) doesn't work.

Our schema.xml contains the following -

 fieldType name=text class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/
 --
 filter 
class=solr.ISOLatin1AccentFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=1 catenateAll=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter 
class=solr.ISOLatin1AccentFilterFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true 
expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=0 catenateNumbers=0 catenateAll=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email: 
bernadette.hough...@deakin.edu.aumailto:bernadette.hough...@deakin.edu.au
Website: http://www.deakin.edu.au
http://www.deakin.edu.au/Deakin University CRICOS Provider Code 00113B (Vic)

Important Notice: The contents of this email are intended solely for the named 
addressee and are confidential; any unauthorised use, reproduction or storage 
of the contents is expressly prohibited. If you have received this email in 
error, please delete it and any attachments immediately and advise the sender 
by return email or telephone.
Deakin University does not warrant that this email and any attachments are 
error or virus free

Re: Indexing and searching of sharded/ partitioned databases and tables

2009-10-07 Thread Jayant Kumar Gandhi

Thanks guys. Now I can easily search thru 10TB of my personal photos,
videos, music and other stuff :)

At some point I had split them into multiple db and tables and inserts
to a single db/ table were taking too much time once the index grew
beyond 1gig. I was storing all the possible metadata about the media.
I used two hex characters for naming tables/dbs and ended up with 256
db, each with 256 tables :D . Don't ask me why I had done it this way.
Let's just say I was exploring sharding some years ago and got too
excited and did that :D. Alas, never touched it again to finish the
search portion till now when I really wanted to find a particular
photo :)

The pk is unique across all the tables so no issues there. I think I
should be able to run it off a single server at my home.

Thanks and Best Regards,
Jayant

On Wed, Oct 7, 2009 at 4:52 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore 
 sandeep.tag...@gmail.comwrote:


 You can write an automated program which will change the DB conf details in
 that xml and fire the full import command. You can use
 http://localhost:8983/solr/dataimport url to check the status of the data
 import.


 Also note that full-import deletes all existing documents. So if you write
 such a program which changes DB conf details, make sure you invoke the
 import command (new in Solr 1.4) to avoid deleting the other documents.

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi

RE: Problems with WordDelimiterFilterFactory

2009-10-07 Thread Bernadette Houghton

Hi Christian, try this one - http://www.deakin.edu.au/dro/view/DU:3601

Either scroll down and click one of the television broadcasting -- asia 
links, or type it in the Quick Search box.


TIA

bern

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com] 
Sent: Thursday, 8 October 2009 9:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Problems with WordDelimiterFilterFactory

Could you please provide the exact URL of a query where you are 
experiencing this problem?
eg(Not URL encoded): q=fieldName:hot and cold: temperatures

On 10/07/2009 05:32 PM, Bernadette Houghton wrote:
 We are having some issues with our solr parent application not retrieving 
 records as expected.

 For example, if the input query includes a colon (e.g. hot and cold: 
 temperatures), the relevant record (which contains a colon in the same place) 
 does not get retrieved; if the input query does not include the colon, all is 
 fine.  Ditto if the user searches for a query containing hyphens, e.g. asia 
 - civilization, although with the qualifier that something like 
 asia-civilization (no spaces either side of the hyphen) works fine, whereas 
 asia - civilization (spaces either side of hyphen) doesn't work.

 Our schema.xml contains the following -

  fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  !-- in this example, we will only use synonyms at query time
  filter class=solr.SynonymFilterFactory 
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
  --
  filter 
 class=solr.ISOLatin1AccentFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=1 
 catenateNumbers=1 catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishPorterFilterFactory 
 protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter 
 class=solr.ISOLatin1AccentFilterFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=0 
 catenateNumbers=0 catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishPorterFilterFactory 
 protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
  /fieldType

 Bernadette Houghton, Library Business Applications Developer
 Deakin University Geelong Victoria 3217 Australia.
 Phone: 03 5227 8230 International: +61 3 5227 8230
 Fax: 03 5227 8000 International: +61 3 5227 8000
 MSN: bern_hough...@hotmail.com
 Email: 
 bernadette.hough...@deakin.edu.aumailto:bernadette.hough...@deakin.edu.au
 Website: http://www.deakin.edu.au
 http://www.deakin.edu.au/Deakin University CRICOS Provider Code 00113B (Vic)

 Important Notice: The contents of this email are intended solely for the 
 named addressee and are confidential; any unauthorised use, reproduction or 
 storage of the contents is expressly prohibited. If you have received this 
 email in error, please delete it and any attachments immediately and advise 
 the sender by return email or telephone.
 Deakin University does not warrant that this email and any attachments are 
 error or virus free

Snapshot is not created when I added spellchecker with buildOnCommit

2009-10-07 Thread marklo


i've enabled the snapshooter to run after commit and it's working fine until
i've added a spellchecker with 
buildOnCommit = true...  Any idea why?   Thanks

  updateHandler class=solr.DirectUpdateHandler2
listener event=postCommit class=solr.RunExecutableListener
  str name=exesolr/bin/snapshooter/str
  str name=dir./str
  bool name=waittrue/bool
  arr name=args strarg1/str strarg2/str /arr
  arr name=env strMYVAR=val1/str /arr
/listener

listener event=postOptimize class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dirsolr/bin/str
  bool name=waittrue/bool
/listener

  /updateHandler

-- 
View this message in context: 
http://www.nabble.com/Snapshot-is-not-created-when-I-added-spellchecker-with-buildOnCommit-tp25796857p25796857.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ISOLatin1AccentFilter before or after Snowball?

2009-10-07 Thread Jay Hill

Correct me if I'm wrong, but wasn't the ISOLatin1AccentFilterFactory
deprecated in favor of:
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/

in 1.4?

-Jay
http://www.lucidimagination.com


On Wed, Oct 7, 2009 at 1:44 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Tue, Oct 6, 2009 at 4:33 PM, Chantal Ackermann 
 chantal.ackerm...@btelligent.de wrote:

  Hi all,
 
  from reading through previous posts on that subject, it seems like the
  accent filter has to come before the snowball filter.
 
  I'd just like to make sure this is so. If it is the case, I'm wondering
  whether snowball filters for i.e. French process accented language
  correctly, at all, or whether they remove accents anyway... Or whether
  accents should be removed whenever making use of snowball filters.
 
 
 I'd think so but I'm not sure. Perhaps someone else can weigh in.


 
  And also: it really is meant to take UTF-8 as input, even though it is
  named ISOLatin1AccentFilter, isn't it?
 
 
 See http://markmail.org/message/hi25u5iqusfu542b

 --
 Regards,
 Shalin Shekhar Mangar.

Re: ISOLatin1AccentFilter before or after Snowball?

2009-10-07 Thread Koji Sekiguchi


No, ISOLatin1AccentFilterFactory is not deprecated.
You can use either MappingCharFilterFactory+mapping-ISOLatin1Accent.txt
or ISOLatin1AccentFilterFactory whichever you'd like.

Koji


Jay Hill wrote:

Correct me if I'm wrong, but wasn't the ISOLatin1AccentFilterFactory
deprecated in favor of:
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/

in 1.4?

-Jay
http://www.lucidimagination.com

Re: Problems with WordDelimiterFilterFactory

2009-10-07 Thread Christian Zambrano


Bern,

I am interested on the solr query. In other words, the query that your  
system sends to solr.


Thanks,


Christian

On Oct 7, 2009, at 5:56 PM, Bernadette Houghton bernadette.hough...@deakin.edu.au 
 wrote:



Hi Christian, try this one - http://www.deakin.edu.au/dro/view/DU:3601

Either scroll down and click one of the television broadcasting --  
asia links, or type it in the Quick Search box.



TIA

bern

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com]
Sent: Thursday, 8 October 2009 9:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Problems with WordDelimiterFilterFactory

Could you please provide the exact URL of a query where you are
experiencing this problem?
eg(Not URL encoded): q=fieldName:hot and cold: temperatures

On 10/07/2009 05:32 PM, Bernadette Houghton wrote:
We are having some issues with our solr parent application not  
retrieving records as expected.


For example, if the input query includes a colon (e.g. hot and  
cold: temperatures), the relevant record (which contains a colon in  
the same place) does not get retrieved; if the input query does not  
include the colon, all is fine.  Ditto if the user searches for a  
query containing hyphens, e.g. asia - civilization, although with  
the qualifier that something like asia-civilization (no spaces  
either side of the hyphen) works fine, whereas asia -  
civilization (spaces either side of hyphen) doesn't work.


Our schema.xml contains the following -

fieldType name=text class=solr.TextField  
positionIncrementGap=100

  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory  
synonyms=index_synonyms.txt ignoreCase=true expand=false/

--
filter  
class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true  
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory  
generateWordParts=1 generateNumberParts=1 catenateWords=1  
catenateNumbers=1 catenateAll=0/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory  
protected=protwords.txt/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter  
class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SynonymFilterFactory  
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true  
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory  
generateWordParts=1 generateNumberParts=1 catenateWords=0  
catenateNumbers=0 catenateAll=0/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory  
protected=protwords.txt/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email: bernadette.hough...@deakin.edu.aumailto:bernadette.hough...@deakin.edu.au 


Website: http://www.deakin.edu.au
http://www.deakin.edu.au/Deakin University CRICOS Provider Code  
00113B (Vic)


Important Notice: The contents of this email are intended solely  
for the named addressee and are confidential; any unauthorised use,  
reproduction or storage of the contents is expressly prohibited. If  
you have received this email in error, please delete it and any  
attachments immediately and advise the sender by return email or  
telephone.
Deakin University does not warrant that this email and any  
attachments are error or virus free

Re: Problems with WordDelimiterFilterFactory

2009-10-07 Thread marklo


Use http://solr-url/solr/admin/analysis.jsp to see how your data is
indexed/queried

-- 
View this message in context: 
http://www.nabble.com/Problems-with-WordDelimiterFilterFactory-tp25795589p25797377.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

2009-10-07 Thread Mint Ekalak

Work like a charm !!

thanks Shalin

Regards,

Mint

Shalin Shekhar Mangar wrote:

On Thu, Oct 8, 2009 at 1:38 AM, michael8 mich...@saracatech.com wrote:

2 things I noticed that are different from 1.3 to 1.4 for DataImport:

1. there are now 2 datetime values (per my specific schema I'm sure) in
the
dataimport.properties vs. only 1 in 1.3 (using the exact same schema).
One
is 'last_index_time' same as 1.3, and a *new* one (in 1.4) named
item.last_index_time, where 'item' is my main and only entity name
specified
in my data-import.xml. they both have the same value.

This was added with SOLR-783 to enable delta imports of entities
individually. One can specify the entity name(s) which should be imported.
Without this it was not possible to correctly figure out deltas on a
per-entity basis.

2. in 1.3, the datetime passed to SQL used to be, e.g., '2009-10-05
14:08:01', but with 1.4 the format becomes 'Mon Oct 05 14:08:01 PDT
2009',
with the day of week, name of month, and timezone spelled out. I had
issue
with the 1.4 format with MySQL only for the timezone part, but now I have
a
different solution without using this last index date altogether.

I just committed SOLR-1496 so the different date format issue is fixed in
trunk.

I'm curious though if there's any config setting to pass to
DataImportHandler to specify the desired date/time format to use.

There is no configuration to change this. However, you can write your own
Evaluator to output ${dih.last_index_time} in whatever format you prefer.

--
Regards,
Shalin Shekhar Mangar.

--
View this message in context:
http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25797806.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TermsComponent or auto-suggest with filter

2009-10-07 Thread R. Tan

Thanks Jay. What's a good way of extracting the original text from here?

On Thu, Oct 8, 2009 at 1:03 AM, Jay Hill jayallenh...@gmail.com wrote:

 Something like this, building on each character typed:

 facet=onfacet.field=tc_queryfacet.prefix=befacet.mincount=1

 -Jay
 http://www.lucidimagination.com


 On Tue, Oct 6, 2009 at 5:43 PM, R. Tan tanrihae...@gmail.com wrote:

  Nice. In comparison, how do you do it with faceting?
 
   Two other approaches are to use either the TermsComponent (new in Solr
   1.4) or faceting.
 
 
 
  On Wed, Oct 7, 2009 at 1:51 AM, Jay Hill jayallenh...@gmail.com wrote:
 
   Have a look at a blog I posted on how to use EdgeNGrams to build an
   auto-suggest tool:
  
  
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
  
   You could easily add filter queries to this approach. Ffor example, the
   query used in the blog could add filter queries like this:
  
   http://localhost:8983/solr/select/?q=user_query:
   ”i”wt=jsonfl=user_queryindent=onechoParams=nonerows=10sort=count
   descfq=yourField:yourQueryfq=anotherField:anotherQuery
  
   -Jay
   http://www.lucidimagination.com
  
  
  
  
   On Tue, Oct 6, 2009 at 4:40 AM, R. Tan tanrihae...@gmail.com wrote:
  
Hello,
What's the best way to get auto-suggested terms/keywords that is
  filtered
by
one or more fields? TermsComponent should have been the solution but
filters
are not supported.
   
Thanks,
Rihaed

Scoring for specific field queries

2009-10-07 Thread R. Tan

Hi,
How can I get wildcard search (e.g. cha*) to score documents based on the
position of the keyword in a field? Closer (to the start) means higher
score.

For example, I have multiple documents with titles containing the word
champion. Some of the document titles start with the word champion and
some our entitled we are the champions. The ones that starts with the
keyword needs to rank first or score higher. Is there a way to do this? I'm
using this query for auto-suggest term feature where the keyword doesn't
necessarily need to be the first word.

Rihaed

Re: How to determine the size of the index?


Are you referring to schema info ??? 
You can find it at http://192.168.5.25/solr/admin/file/?file=schema.xml and
http://192.168.5.25/solr/admin/schema.jsp


Fishman, Vladimir wrote:
 
  Is this info available via admin page?
 
-- 
View this message in context: 
http://www.nabble.com/How-to-retrieve-the-index-of-a-string-within-a-field--tp25771821p25798508.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Scoring for specific field queries

You would need to boost your startswith matches artificially for the
desired behavior.
I would do it this way -

   1. Create a KeywordTokenized field with n-gram filter.
   2. Create a Whitespace tokenized field with n-gram flter.
   3. Search on both the fields, boost matches for #1 over #2.

Hope this helps.

Cheers
Avlesh

On Thu, Oct 8, 2009 at 10:30 AM, R. Tan tanrihae...@gmail.com wrote:

 Hi,
 How can I get wildcard search (e.g. cha*) to score documents based on the
 position of the keyword in a field? Closer (to the start) means higher
 score.

 For example, I have multiple documents with titles containing the word
 champion. Some of the document titles start with the word champion and
 some our entitled we are the champions. The ones that starts with the
 keyword needs to rank first or score higher. Is there a way to do this? I'm
 using this query for auto-suggest term feature where the keyword doesn't
 necessarily need to be the first word.

 Rihaed

Re: How to retrieve the index of a string within a field?


Elaine,
The field type text contains tokenizer
class=solr.WhitespaceTokenizerFactory/ in its definition. So all the
sentences that are indexed / queried will be split in to words. So when you
search for 'get what you', you will get sentences containing get, what, you,
get what, get you, what you, get what you. So when you try to find the
indexOf of the keyword in that sentence (from results), you may not get it
all the times.

Solrj can give the results in one shot but it uses http call. You cant avoid
it. You don't need to query multiple times with Solrj. Query once, get the
results, store them in java beans, process it and display the results.

Regards,
Sandeep


Elaine Li wrote:
 
 Sandeep, I do get results when I search for get what you, not 0 results.
 What in my schema makes this difference?
 I need to learn Solrj. I am currently using javascript as a client and
 invoke http calls to get results to display in the browser. Can Solrj
 get all the results at one short w/o the http call? I need to do some
 postprocessing against all the results and then display the processed
 data. Submitting multiple http queries and post-process after each
 query does not seem to be the right way.
 
-- 
View this message in context: 
http://www.nabble.com/How-to-retrieve-the-index-of-a-string-within-a-field--tp25771821p25798586.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Scoring for specific field queries


Hi Rihaed,
I guess we don't need to depend on scores all the times.
You can use custom sort to sort the results. Take a dynamicField, fill it
with indexOf(keyword) value, sort the results by the field in ascending
order. Then the records which contain the keyword at the earlier position
will come first.

Regards,
Sandeep


R. Tan wrote:
 
 Hi,
 How can I get wildcard search (e.g. cha*) to score documents based on the
 position of the keyword in a field? Closer (to the start) means higher
 score.
 

-- 
View this message in context: 
http://www.nabble.com/Scoring-for-specific-field-queries-tp25798390p25798657.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Scoring for specific field queries


 I guess we don't need to depend on scores all the times.
 You can use custom sort to sort the results. Take a dynamicField, fill it
 with indexOf(keyword) value, sort the results by the field in ascending
 order. Then the records which contain the keyword at the earlier position
 will come first.

Warning: This is a bad idea for multiple reasons:

   1. If the word computer occurs in multiple times in a document what
   would you do in that case? Is this dynamic field supposed to be multivalued?
   I can't even imagine what would you do if the word computer occurs in
   multiple documents multiple times?
   2. Multivalued fields cannot be sorted upon.
   3. One needs to know the unique number of such keywords before
   implementing because you'll potentially end up creating those many fields.

Cheers
Avlesh

On Thu, Oct 8, 2009 at 11:10 AM, Sandeep Tagore sandeep.tag...@gmail.comwrote:


 Hi Rihaed,
 I guess we don't need to depend on scores all the times.
 You can use custom sort to sort the results. Take a dynamicField, fill it
 with indexOf(keyword) value, sort the results by the field in ascending
 order. Then the records which contain the keyword at the earlier position
 will come first.

 Regards,
 Sandeep


 R. Tan wrote:
 
  Hi,
  How can I get wildcard search (e.g. cha*) to score documents based on the
  position of the keyword in a field? Closer (to the start) means higher
  score.
 

 --
 View this message in context:
 http://www.nabble.com/Scoring-for-specific-field-queries-tp25798390p25798657.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: delay while adding document to solr index

2009-10-07 Thread swapna_here


thanks for your reply 
but sorry for the delay 

as you said i have removed the commit while adding single document and set
the auto commit for
  maxDocs200/maxDocs
  maxTime1/maxTime

after setting when i run optimize() manually the size decreased to
350MB(10 docs) from 638MB(10 docs)

i think this happened because i run the optimize for the first time on index
data that is configured 4 months back..

this worked great but after one week again the index size reached 504MB
(10 docs) 

i don't understand why my solr index increasing daily 
when i am adding and deleting the same number of documents daily

i run org.apache.solr.client.solrj.SolrServer.optimize() manually four times
a day

is it not the right way to run optimize, if yes what is the procedure to run
optimize?

thanks in advance :)
-- 
View this message in context: 
http://www.nabble.com/delay-while-adding-document-to-solr-index-tp25676777p25798789.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Problems with WordDelimiterFilterFactory