Re: NRT or similar for Solr 3.5?

2011-12-12 Thread vikram kamath
The Onclick handler does not seem to be called on google chrome (Ubuntu ).

Also , I dont seem to receive the email with the confirmation link on
registering (I have checked my spam)




Regards
Vikram Kamath



2011/12/12 Nagendra Nagarajayya nnagaraja...@transaxtions.com

 Steven:

 There is an onclick handler that allows you to download the src. BTW, an
 early access Solr 3.5 with RankingAlgorithm 1.3 (NRT) release is
 available for download. So please give it a try.

 Regards,

 - Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org


 On 12/10/2011 11:18 PM, Steven Ou wrote:
  All the links on the download section link to http://solr-ra.tgels.org/#
  --
  Steven Ou | 歐偉凡
 
  *ravn.com* | Chief Technology Officer
  steve...@gmail.com | +1 909-569-9880
 
 
  2011/12/11 Nagendra Nagarajayya nnagaraja...@transaxtions.com
 
  Steven:
 
  Not sure why you had problems, #downloads (
  http://solr-ra.tgels.org/#downloads ) should point you to the downloads
  section showing the different versions available for download ? Please
  share if this is not so ( there were downloads yesterday with no
 problems )
 
  Regarding NRT, you can switch between RA and Lucene at query level or at
  config level; in the current version with RA, NRT is in effect while
  with lucene, it is not, you can get more information from here:
  http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf
 
  Solr 3.5 with RankingAlgorithm 1.3 should be available next week.
 
  Regards,
 
  - Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  On 12/9/2011 4:49 PM, Steven Ou wrote:
  Hey Nagendra,
 
  I took a look and Solr-RA looks promising - but:
 
 - I could not figure out how to download it. It seems like all the
 download links just point to #
 - I wasn't looking for another ranking algorithm, so would it be
 possible for me to use NRT but *not* RA (i.e. just use the normal
  Lucene
 library)?
 
  --
  Steven Ou | 歐偉凡
 
  *ravn.com* | Chief Technology Officer
  steve...@gmail.com | +1 909-569-9880
 
 
  On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya 
  nnagaraja...@transaxtions.com wrote:
 
  Steven:
 
  Please take a look at Solr  with RankingAlgorithm. It offers NRT
  functionality. You can set your autoCommit to about 15 mins. You can
 get
  more information from here:
  http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x
  http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x
 
  Regards,
 
  - Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.**org 
 http://rankingalgorithm.tgels.org
 
 
  On 12/8/2011 9:30 PM, Steven Ou wrote:
 
  Hi guys,
 
  I'm looking for NRT functionality or similar in Solr 3.5. Is that
  possible?
  From what I understand there's NRT in Solr 4, but I can't figure out
  whether or not 3.5 can do it as well?
 
  If not, is it feasible to use an autoCommit every 1000ms? We don't
  currently process *that* much data so I wonder if it's OK to just
  commit
  very often? Obviously not scalable on a large scale, but it is
 feasible
  for
  a relatively small amount of data?
 
  I recently upgraded from Solr 1.4 to 3.5. I had a hard time getting
  everything working smoothly and the process ended up taking my site
  down
  for a couple hours. I am very hesitant to upgrade to Solr 4 if it's
 not
  necessary to get some sort of NRT functionality.
 
  Can anyone help me? Thanks!
  --
  Steven Ou | 歐偉凡
 
  *ravn.com* | Chief Technology Officer
  steve...@gmail.com | +1 909-569-9880
 
 
 




Re: cache monitoring tools?

2011-12-12 Thread Dmitry Kan
Hoss, I can't see why Network IO is the issue as the shards and the front
end SOLR resided on the same server. I said resided, because I got rid of
the front end (which according to my measurements, was taking at least as
much time for merging as it took to find the actual data in the shards) and
shards. Now I have only one shard having all the data. Filter cache tuning
also helped to reduce the amount of evictions to a minimum.

Dmitry

On Fri, Dec 9, 2011 at 10:42 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : The culprit seems to be the merger (frontend) SOLR. Talking to one shard
 : directly takes substantially less time (1-2 sec).
 ...
 :facet.limit=50

 Your probably most likeley has very little to do with your caches at all
 -- a facet.limit that high requires sending a very large amount of data
 over the wire, multiplied by the number of shards, multipled by some
 constant (i think it's 2 but it might be higher) in order to over
 request facet constriant counts from each shard to aggregate them.

 the dominant factor in the slow speed you are seeing is most likeley
 Network IO between the shards.



 -Hoss




-- 
Regards,

Dmitry Kan


Re: cache monitoring tools?

2011-12-12 Thread Dmitry Kan
Paul, have you checked solrmeter and zabbix?

Dmitry

On Fri, Dec 9, 2011 at 11:16 PM, Paul Libbrecht p...@hoplahup.net wrote:

 Allow me to chim in and ask a generic question about monitoring tools for
 people close to developers: are any of the tools mentioned in this thread
 actually able to show graphs of loads, e.g. cache counts or CPU load, in
 parallel to a console log or to an http request log??

 I am working on such a tool currently but I have a bad feeling of
 reinventing the wheel.

 thanks in advance

 Paul



 Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit :

  Otis, Tomás: thanks for the great links!
 
  2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com
 
  Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use
 any
  tool that visualizes JMX stuff like Zabbix. See
 
 
 http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
 
  On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan dmitry@gmail.com
 wrote:
 
  The culprit seems to be the merger (frontend) SOLR. Talking to one
 shard
  directly takes substantially less time (1-2 sec).
 
  On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
  Tomás: thanks. The page you gave didn't mention cache specifically, is
  there more documentation on this specifically? I have used solrmeter
  tool,
  it draws the cache diagrams, is there a similar tool, but which would
  use
  jmx directly and present the cache usage in runtime?
 
  pravesh:
  I have increased the size of filterCache, but the search hasn't become
  any
  faster, taking almost 9 sec on avg :(
 
  name: search
  class: org.apache.solr.handler.component.SearchHandler
  version: $Revision: 1052938 $
  description: Search using components:
 
 
 
 org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
 
  stats: handlerStart : 1323255147351
  requests : 100
  errors : 3
  timeouts : 0
  totalTime : 885438
  avgTimePerRequest : 8854.38
  avgRequestsPerSecond : 0.008789442
 
  the stats (copying fieldValueCache as well here, to show term
  statistics):
 
  name: fieldValueCache
  class: org.apache.solr.search.FastLRUCache
  version: 1.0
  description: Concurrent LRU Cache(maxSize=1, initialSize=10,
  minSize=9000, acceptableSize=9500, cleanupThread=false)
  stats: lookups : 79
  hits : 77
  hitratio : 0.97
  inserts : 1
  evictions : 0
  size : 1
  warmupTime : 0
  cumulative_lookups : 79
  cumulative_hits : 77
  cumulative_hitratio : 0.97
  cumulative_inserts : 1
  cumulative_evictions : 0
  item_shingleContent_trigram :
 
 
 
 {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
  name: filterCache
  class: org.apache.solr.search.FastLRUCache
  version: 1.0
  description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
  minSize=138240, acceptableSize=145920, cleanupThread=false)
  stats: lookups : 1082854
  hits : 940370
  hitratio : 0.86
  inserts : 142486
  evictions : 0
  size : 142486
  warmupTime : 0
  cumulative_lookups : 1082854
  cumulative_hits : 940370
  cumulative_hitratio : 0.86
  cumulative_inserts : 142486
  cumulative_evictions : 0
 
 
  index size: 3,25 GB
 
  Does anyone have some pointers to where to look at and optimize for
  query
  time?
 
 
  2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com
 
  Hi Dimitry, cache information is exposed via JMX, so you should be
  able
  to
  monitor that information with any JMX tool. See
  http://wiki.apache.org/solr/SolrJmx
 
  On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan dmitry@gmail.com
  wrote:
 
  Yes, we do require that much.
  Ok, thanks, I will try increasing the maxsize.
 
  On Wed, Dec 7, 2011 at 10:56 AM, pravesh suyalprav...@yahoo.com
  wrote:
 
  facet.limit=50
  your facet.limit seems too high. Do you actually require this
  much?
 
  Since there a lot of evictions from filtercache, so, increase the
  maxsize
  value to your acceptable limit.
 
  Regards
  Pravesh
 
  --
  View this message in context:
 
 
 
 
 
 http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  Regards,
 
  Dmitry Kan
 
 
 
 
 
  --
  Regards,
 
  Dmitry Kan
 
 
 
 
  --
  Regards,
 
  Dmitry Kan
 
 
 
 
 
  --
  Regards,
 
  Dmitry Kan




-- 
Regards,

Dmitry Kan


limiting the content of content field in search results

2011-12-12 Thread ayyappan
I am developing n application which indexes whole pdfs and other documents
to solr. I have completed a working version of my application. But there are
some problems. The main one is that when I do a search the indexed whole
document is shown. I have used solrj and need some help to reduce this
content. 

How limiting the content of content field in search results and display
over there .

i need like this 



*Grammer1.docx*
Blazing – burring Faceted Cluster – to gather Geospatial Replication –
coping Distinguish – apart from Flawlessly – perfectly Recipe –method
Concentrated inscription 
Last Modified : 2011-12-11T14:42:27Z

*who.pdf*
Who We Are Table of contents 1 Solr Committers (in alphabetical
order)fgfgfgfg2 2 Inactive Committers (in alphabetical orde 

*version_control.pdf*
Solr Version Control System Table of contents 1 Overview.gfgfgfg 2 Web Acce 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/limiting-the-content-of-content-field-in-search-results-tp3578859p3578859.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 3.4 problem with words separated by coma without space

2011-12-12 Thread elisabeth benoit
Thanks for the answer.

yes in fact when I look at debugQuery output, I notice that name and number
are never treated as single entries.

I have

(((text:name text:number)) (text:ru) (text:tain) (text:paris)))

so name and number are in same parenthesis, but not exactlly treated as a
phrase, as far as I know, since a phrase would be more like text:name
number.

could you tell me what is the difference between (text:name text:number)
and (text:name number)?

I'll check autoGeneratePhraseQueries.

Best regards,
Elisabeth




2011/12/8 Chris Hostetter hossman_luc...@fucit.org


 : If I check in the solr.admin.analyzer, I get the same analysis for the
 two
 : different requests. But it seems, if fact, that the lacking space after
 : coma prevents name and number from matching.

 query analysis is only part of hte picture ... Did you look at the
 debuqQuery output? ...  i believe you are seeing the effects of the
 QueryParser analyzing name, distinctly from number in one case, vs
 analyzing the entire string name,number in the second case, an treating
 the later as a phrase query (because one input clause produces multiple
 tokens)

 there is a recently added autoGeneratePhraseQueries option that affects
 this.


 -Hoss



Re: cache monitoring tools?

2011-12-12 Thread Dmitry Kan
Justin, in terms of the overhead, have you noticed if Munin puts much of it
when used in production? In terms of the solr farm: how big is a shard's
index (given you have sharded architecture).

Dmitry

On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas
justin.carat...@gmail.comwrote:

 At my work, we use Munin and Nagio for monitoring and alerts.  Munin is
 great because writing a plugin for it so simple, and with Solr's
 statistics handler, we can track almost any solr stat we want.  It also
 comes with included plugins for load, file system stats, processes,
 etc.

 http://munin-monitoring.org/

 Justin

 Paul Libbrecht p...@hoplahup.net writes:

  Allow me to chim in and ask a generic question about monitoring tools
  for people close to developers: are any of the tools mentioned in this
  thread actually able to show graphs of loads, e.g. cache counts or CPU
  load, in parallel to a console log or to an http request log??
 
  I am working on such a tool currently but I have a bad feeling of
 reinventing the wheel.
 
  thanks in advance
 
  Paul
 
 
 
  Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit :
 
  Otis, Tomás: thanks for the great links!
 
  2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com
 
  Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use
 any
  tool that visualizes JMX stuff like Zabbix. See
 
 
 http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
 
  On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan dmitry@gmail.com
 wrote:
 
  The culprit seems to be the merger (frontend) SOLR. Talking to one
 shard
  directly takes substantially less time (1-2 sec).
 
  On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
  Tomás: thanks. The page you gave didn't mention cache specifically,
 is
  there more documentation on this specifically? I have used solrmeter
  tool,
  it draws the cache diagrams, is there a similar tool, but which would
  use
  jmx directly and present the cache usage in runtime?
 
  pravesh:
  I have increased the size of filterCache, but the search hasn't
 become
  any
  faster, taking almost 9 sec on avg :(
 
  name: search
  class: org.apache.solr.handler.component.SearchHandler
  version: $Revision: 1052938 $
  description: Search using components:
 
 
 
 org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
 
  stats: handlerStart : 1323255147351
  requests : 100
  errors : 3
  timeouts : 0
  totalTime : 885438
  avgTimePerRequest : 8854.38
  avgRequestsPerSecond : 0.008789442
 
  the stats (copying fieldValueCache as well here, to show term
  statistics):
 
  name: fieldValueCache
  class: org.apache.solr.search.FastLRUCache
  version: 1.0
  description: Concurrent LRU Cache(maxSize=1, initialSize=10,
  minSize=9000, acceptableSize=9500, cleanupThread=false)
  stats: lookups : 79
  hits : 77
  hitratio : 0.97
  inserts : 1
  evictions : 0
  size : 1
  warmupTime : 0
  cumulative_lookups : 79
  cumulative_hits : 77
  cumulative_hitratio : 0.97
  cumulative_inserts : 1
  cumulative_evictions : 0
  item_shingleContent_trigram :
 
 
 
 {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
  name: filterCache
  class: org.apache.solr.search.FastLRUCache
  version: 1.0
  description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
  minSize=138240, acceptableSize=145920, cleanupThread=false)
  stats: lookups : 1082854
  hits : 940370
  hitratio : 0.86
  inserts : 142486
  evictions : 0
  size : 142486
  warmupTime : 0
  cumulative_lookups : 1082854
  cumulative_hits : 940370
  cumulative_hitratio : 0.86
  cumulative_inserts : 142486
  cumulative_evictions : 0
 
 
  index size: 3,25 GB
 
  Does anyone have some pointers to where to look at and optimize for
  query
  time?
 
 
  2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com
 
  Hi Dimitry, cache information is exposed via JMX, so you should be
  able
  to
  monitor that information with any JMX tool. See
  http://wiki.apache.org/solr/SolrJmx
 
  On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan dmitry@gmail.com
  wrote:
 
  Yes, we do require that much.
  Ok, thanks, I will try increasing the maxsize.
 
  On Wed, Dec 7, 2011 at 10:56 AM, pravesh suyalprav...@yahoo.com
  wrote:
 
  facet.limit=50
  your facet.limit seems too high. Do you actually require this
  much?
 
  Since there a lot of evictions from filtercache, so, increase the
  maxsize
  value to your acceptable limit.
 
  Regards
  Pravesh
 
  --
  View this message in context:
 
 
 
 
 
 http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html
  Sent from the Solr - User mailing list archive at 

InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Max
Hi there,

when highlighting a field with this definition:

fieldType name=name class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ICUTransformFilterFactory id=Any-Latin/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
splitOnCaseChange=1/
filter class=solr.EdgeNGramFilterFactory
minGramSize=2 maxGramSize=15 side=front/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ICUTransformFilterFactory id=Any-Latin/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
splitOnCaseChange=1/
filter class=solr.EdgeNGramFilterFactory
minGramSize=2 maxGramSize=15 side=front/
/analyzer
/fieldType

containing this string:

Mosfellsbær

I get the following exception, if that field is in the highlight fields:

SEVERE: org.apache.solr.common.SolrException:
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
mosfellsbaer exceeds length of provided text sized 11
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:497)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
Token mosfellsbaer exceeds length of provided text sized 11
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)

I tried with solr 3.4 and 3.5, same error for both. Removing the char
filter didnt fix the problem either.

It seems like there is some weird stuff going on when folding the
string, it can be seen in the analysis view, too:

http://i.imgur.com/6B2Uh.png

The end offset remains 11 even after folding and transforming æ to
ae, which seems wrong to me.

I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500
which seems like a similiar issue.

Is there a workaround for that problem or is the field configuration wrong?


Ask about the question of solr cache

2011-12-12 Thread JiaoyanChen
When I have delete or add data by application through solrj, or have
import index through command nutch solrindex, the cache of solr are not
changed if I do not restart solr.
Could anyone tell me how could I update solr cache without restarting
using shell command?
When I recreate the index by nutch, I should update data in solr. 
I use java -jar start.jar to publish solr.
Thanks!



Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Robert Muir
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote:

 The end offset remains 11 even after folding and transforming æ to
 ae, which seems wrong to me.

End offsets refer to the *original text* so this is correct.

What is wrong, is EdgeNGramsFilter. See how it turns that 11 to a 12?


 I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500
 which seems like a similiar issue.

 Is there a workaround for that problem or is the field configuration wrong?

For now, don't use EdgeNGrams.

-- 
lucidimagination.com


Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Robert Muir
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote:

 It seems like there is some weird stuff going on when folding the
 string, it can be seen in the analysis view, too:

 http://i.imgur.com/6B2Uh.png


I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642

Thanks for the screenshot, makes it easy to do a test case here.

-- 
lucidimagination.com


Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Max
Robert, thank you for creating the issue in JIRA.

However, I need ngrams on that field – is there an alternative to the
EdgeNGramFilterFactory ?

Thanks!

On Mon, Dec 12, 2011 at 1:25 PM, Robert Muir rcm...@gmail.com wrote:
 On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote:

 It seems like there is some weird stuff going on when folding the
 string, it can be seen in the analysis view, too:

 http://i.imgur.com/6B2Uh.png


 I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642

 Thanks for the screenshot, makes it easy to do a test case here.

 --
 lucidimagination.com


Re: Setting group.ngroups=true considerable slows down queries

2011-12-12 Thread Martijn v Groningen
Hi!

As as I know currently there isn't another way. Unfortunately the
performance degrades badly when having a lot of unique groups.
I think an issue should be opened to investigate how we can improve this...

Question: Does Solr have a decent chuck of heap space (-Xmx)? Because
grouping requires quite some heap space (also without
group.ngroups=true).

Martijn

On 9 December 2011 23:08, Michael Jakl jakl.mich...@gmail.com wrote:
 Hi!

 On Fri, Dec 9, 2011 at 17:41, Martijn v Groningen
 martijn.v.gronin...@gmail.com wrote:
 On what field type are you grouping and what version of Solr are you
 using? Grouping by string field is faster.

 The field is defined as follows:
 field name=signature type=string indexed=true stored=true /

 Grouping itself is quite fast, only computing the number of groups
 seems to increase significantly with the number of documents (linear).

 I was hoping for a faster solution to compute the total number of
 distinct documents (or in other terms, the number of distinct values
 in the signature field). Facets came to mind, but as far as I could
 see, they don't offer a total number of facets as well.

 I'm using Solr 3.5 (upgraded from Solr 3.4 without reindexing).

 Thanks,
 Michael

 On 9 December 2011 12:46, Michael Jakl jakl.mich...@gmail.com wrote:
 Hi, I'm using the grouping feature of Solr to return a list of unique
 documents together with a count of the duplicates.

 Essentially I use Solr's signature algorithm to create the signature
 field and use grouping on it.

 To provide good numbers for paging through my result list, I'd like to
 compute the total number of documents found (= matches) and the number
 of unique documents (= ngroups). Unfortunately, enabling
 group.ngroups considerably slows down the query (from 500ms to
 23000ms for a result list of roughly 30 documents).

 Is there a faster way to compute the number of groups (or unique
 values in the signature field) in the search result? My Solr instance
 currently contains about 50 million documents and around 10% of them
 are duplicates.

 Thank you,
 Michael



 --
 Met vriendelijke groet,

 Martijn van Groningen



-- 
Met vriendelijke groet,

Martijn van Groningen


ExtractingRequestHandler and HTML

2011-12-12 Thread Michael Kelleher
I am submitting HTML document to Solr using the ERH.  Is it possible to 
store the contents of the document (including all markup) into a field?  
Using fmap.content (I am assuming this comes from Tika) stores the 
extracted text of the document in a field, but not the markup.  I want 
the whole un-altered document.


Is this possible?

thanks

--mike


Re: performance of json vs xml?

2011-12-12 Thread Erick Erickson
How are you getting your documents into Solr? Because
if you're using SolrJ it's a moot point because a binary
format is used.

I haven't done any specific comparisons, but I'd be
surprised if JSON took longer.

And removing a whole operation from your update
chain that had to be kept fed and watered is worth
the risk of a bit of slowdown.

In other words, Try it and see G...

Best
Erick

On Sun, Dec 11, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote:
 I'm thinking about modifying my index process to use json because all my
 docs are originally in json anyway . Are there any performance issues if I
 insert json docs instead of xml docs?  A colleague recommended to me to
 stay with xml because solr is highly optimized for xml.


Re: Setting group.ngroups=true considerable slows down queries

2011-12-12 Thread Michael Jakl
Hi!

On Mon, Dec 12, 2011 at 13:57, Martijn v Groningen
martijn.v.gronin...@gmail.com wrote:
 As as I know currently there isn't another way. Unfortunately the
 performance degrades badly when having a lot of unique groups.
 I think an issue should be opened to investigate how we can improve this...

 Question: Does Solr have a decent chuck of heap space (-Xmx)? Because
 grouping requires quite some heap space (also without
 group.ngroups=true).

Thanks, for answering. The Server has gotten as much memory as the
machine can afford (without swapping):
  -Xmx21g \
  -Xms4g \

Shall I open an issue as a subtask of SOLR-236 even though there is
already a performance related task (SOLR-2205)?

Cheers,
Michael


Re: Setting group.ngroups=true considerable slows down queries

2011-12-12 Thread Martijn v Groningen
I'd not make a subtaks onder SOLR-236 b/c it is related to a
completely different implementation which was never committed.
SOLR-2205 is related to general result grouping and think should be closed.
I'd make a new issue for improving the performance of
group.ngroups=true when there are a lot of unique groups.

Martijn

On 12 December 2011 14:32, Michael Jakl jakl.mich...@gmail.com wrote:
 Hi!

 On Mon, Dec 12, 2011 at 13:57, Martijn v Groningen
 martijn.v.gronin...@gmail.com wrote:
 As as I know currently there isn't another way. Unfortunately the
 performance degrades badly when having a lot of unique groups.
 I think an issue should be opened to investigate how we can improve this...

 Question: Does Solr have a decent chuck of heap space (-Xmx)? Because
 grouping requires quite some heap space (also without
 group.ngroups=true).

 Thanks, for answering. The Server has gotten as much memory as the
 machine can afford (without swapping):
  -Xmx21g \
  -Xms4g \

 Shall I open an issue as a subtask of SOLR-236 even though there is
 already a performance related task (SOLR-2205)?

 Cheers,
 Michael



-- 
Met vriendelijke groet,

Martijn van Groningen


manipulate the results coming back from SOLR? (was: possible to do arithmetic on returned values?)

2011-12-12 Thread Gabriel Cooper
I'm hoping I just got lost in the shuffle due to posting on a Friday 
night. Is there a way to change a field's data via some function, e.g. 
add, subtract, product, etc.?



On 12/9/11 4:17 PM, Gabriel Cooper wrote:

Is there a way to manipulate the results coming back from SOLR?

I have a SOLR 3.5 index that contains values in cents (e.g. 100 in the
index represents $1.00) and in certain contexts (e.g. CSV export) I'd
like to divide by 100 for that field to provide a user-friendly in
dollars number. To do this I played around with Function Queries for a
while without realizing they're limited to relevancy scores, and later
found DocTransformers in 4.0 whose description sounded right but don't
exist in 3.5.

Is there anything else I haven't considered?

Thanks for any help

Gabriel Cooper.




Re: cache monitoring tools?

2011-12-12 Thread Justin Caratzas
Dmitry,

The only added stress that munin puts on each box is the 1 request per
stat per 5 minutes to our admin stats handler.  Given that we get 25
requests per second, this doesn't make much of a difference.  We don't
have a sharded index (yet) as our index is only 2-3 GB, but we do have slave 
servers with replicated
indexes that handle the queries, while our master handles
updates/commits.

Justin

Dmitry Kan dmitry@gmail.com writes:

 Justin, in terms of the overhead, have you noticed if Munin puts much of it
 when used in production? In terms of the solr farm: how big is a shard's
 index (given you have sharded architecture).

 Dmitry

 On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas
 justin.carat...@gmail.comwrote:

 At my work, we use Munin and Nagio for monitoring and alerts.  Munin is
 great because writing a plugin for it so simple, and with Solr's
 statistics handler, we can track almost any solr stat we want.  It also
 comes with included plugins for load, file system stats, processes,
 etc.

 http://munin-monitoring.org/

 Justin

 Paul Libbrecht p...@hoplahup.net writes:

  Allow me to chim in and ask a generic question about monitoring tools
  for people close to developers: are any of the tools mentioned in this
  thread actually able to show graphs of loads, e.g. cache counts or CPU
  load, in parallel to a console log or to an http request log??
 
  I am working on such a tool currently but I have a bad feeling of
 reinventing the wheel.
 
  thanks in advance
 
  Paul
 
 
 
  Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit :
 
  Otis, Tomás: thanks for the great links!
 
  2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com
 
  Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use
 any
  tool that visualizes JMX stuff like Zabbix. See
 
 
 http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
 
  On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan dmitry@gmail.com
 wrote:
 
  The culprit seems to be the merger (frontend) SOLR. Talking to one
 shard
  directly takes substantially less time (1-2 sec).
 
  On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan dmitry@gmail.com
 wrote:
 
  Tomás: thanks. The page you gave didn't mention cache specifically,
 is
  there more documentation on this specifically? I have used solrmeter
  tool,
  it draws the cache diagrams, is there a similar tool, but which would
  use
  jmx directly and present the cache usage in runtime?
 
  pravesh:
  I have increased the size of filterCache, but the search hasn't
 become
  any
  faster, taking almost 9 sec on avg :(
 
  name: search
  class: org.apache.solr.handler.component.SearchHandler
  version: $Revision: 1052938 $
  description: Search using components:
 
 
 
 org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
 
  stats: handlerStart : 1323255147351
  requests : 100
  errors : 3
  timeouts : 0
  totalTime : 885438
  avgTimePerRequest : 8854.38
  avgRequestsPerSecond : 0.008789442
 
  the stats (copying fieldValueCache as well here, to show term
  statistics):
 
  name: fieldValueCache
  class: org.apache.solr.search.FastLRUCache
  version: 1.0
  description: Concurrent LRU Cache(maxSize=1, initialSize=10,
  minSize=9000, acceptableSize=9500, cleanupThread=false)
  stats: lookups : 79
  hits : 77
  hitratio : 0.97
  inserts : 1
  evictions : 0
  size : 1
  warmupTime : 0
  cumulative_lookups : 79
  cumulative_hits : 77
  cumulative_hitratio : 0.97
  cumulative_inserts : 1
  cumulative_evictions : 0
  item_shingleContent_trigram :
 
 
 
 {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
  name: filterCache
  class: org.apache.solr.search.FastLRUCache
  version: 1.0
  description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
  minSize=138240, acceptableSize=145920, cleanupThread=false)
  stats: lookups : 1082854
  hits : 940370
  hitratio : 0.86
  inserts : 142486
  evictions : 0
  size : 142486
  warmupTime : 0
  cumulative_lookups : 1082854
  cumulative_hits : 940370
  cumulative_hitratio : 0.86
  cumulative_inserts : 142486
  cumulative_evictions : 0
 
 
  index size: 3,25 GB
 
  Does anyone have some pointers to where to look at and optimize for
  query
  time?
 
 
  2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com
 
  Hi Dimitry, cache information is exposed via JMX, so you should be
  able
  to
  monitor that information with any JMX tool. See
  http://wiki.apache.org/solr/SolrJmx
 
  On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan dmitry@gmail.com
  wrote:
 
  Yes, we do require that much.
  Ok, thanks, I will try increasing the maxsize.
 
  On Wed, Dec 7, 2011 at 10:56 AM, 

Re: RegexQuery performance

2011-12-12 Thread Jay Luker
On Sat, Dec 10, 2011 at 9:25 PM, Erick Erickson erickerick...@gmail.com wrote:
 My off-the-top-of-my-head notion is you implement a
 Filter whose job is to emit some special tokens when
 you find strings like this that allow you to search without
 regexes. For instance, in the example you give, you could
 index something like...oh... I don't know, ###VER### as
 well as the normal text of IRAS-A-FPA-3-RDR-IMPS-V6.0.
 Now, when searching for docs with the pattern you used
 as an example, you look for ###VER### instead. I guess
 it all depends on how many regexes you need to allow.
 This wouldn't work at all if you allow users to put in arbitrary
 regexes, but if you have a small enough number of patterns
 you'll allow, something like this could work.

This is a great suggestion. I think the number of users that need this
feature, as well as the variety of regexs that would be used, is small
enough that it could definitely work. I turns it into a problem of
collecting the necessary regexes, plus the UI details.

Thanks!
--jay


Re: limiting the content of content field in search results

2011-12-12 Thread Juan Grande
Hi,

It sounds like highlighting might be the solution for you. See
http://wiki.apache.org/solr/HighlightingParameters

*Juan*



On Mon, Dec 12, 2011 at 4:42 AM, ayyappan ayyaba...@gmail.com wrote:

 I am developing n application which indexes whole pdfs and other documents
 to solr. I have completed a working version of my application. But there
 are
 some problems. The main one is that when I do a search the indexed whole
 document is shown. I have used solrj and need some help to reduce this
 content.

How limiting the content of content field in search results and display
 over there .

 i need like this



 *Grammer1.docx*
 Blazing – burring Faceted Cluster – to gather Geospatial Replication
 –
 coping Distinguish – apart from Flawlessly – perfectly Recipe –method
 Concentrated inscription
 Last Modified : 2011-12-11T14:42:27Z

 *who.pdf*
 Who We Are Table of contents 1 Solr Committers (in alphabetical
 order)fgfgfgfg2 2 Inactive Committers (in alphabetical orde

 *version_control.pdf*
 Solr Version Control System Table of contents 1 Overview.gfgfgfg 2 Web Acce

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/limiting-the-content-of-content-field-in-search-results-tp3578859p3578859.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Solr Load Testing

2011-12-12 Thread Kissue Kissue
Hi,

I ran some jmeter load testing on my solr instance version 3.5.0 running on
tomcat 6.6.29 using 1000 concurrent users and the error below is thrown
after a certain number of requests. My solr configuration is basically the
default configuration at this time. Has anybody done soemthing similar?
Should solr be able to handle 1000 concurrent users based on the default
configuration? Any ideas let me know. Thanks.

12-Dec-2011 15:56:02 org.apache.solr.common.SolrException log
SEVERE: ClientAbortException:  java.io.IOException
at
org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:319)
at
org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288)
at
org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:98)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:344)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:861)
at
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
at
org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1584)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException
at
org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer(InternalAprOutputBuffer.java:696)
at
org.apache.coyote.http11.InternalAprOutputBuffer.flush(InternalAprOutputBuffer.java:284)
at
org.apache.coyote.http11.Http11AprProcessor.action(Http11AprProcessor.java:1016)
at org.apache.coyote.Response.action(Response.java:183)
at
org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314)
... 20 more


Re: Virtual Memory very high

2011-12-12 Thread Yury Kats
On 12/11/2011 4:57 AM, Rohit wrote:
 What are the difference in the different DirectoryFactory?

http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/store/MMapDirectory.html
http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/store/NIOFSDirectory.html


Re: MySQL data import

2011-12-12 Thread Brian Lamb
Hi all,

Any tips on this one?

Thanks,

Brian Lamb

On Sun, Dec 11, 2011 at 3:54 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 Hi all,

 I have a few questions about how the MySQL data import works. It seems it
 creates a separate connection for each entity I create. Is there any way to
 avoid this?

 By nature of my schema, I have several multivalued fields. Each one I
 populate with a separate entity. Is there a better way to do it? For
 example, could I pull in all the singular data in one sitting and then come
 back in later and populate with the multivalued items.

 An alternate approach in some cases would be to do a GROUP_CONCAT and then
 populate the multivalued column with some transformation. Is that possible?

 Lastly, is it possible to use copyField to copy three regular fields into
 one multiValued field and have all the data show up?

 Thanks,

 Brian Lamb



URLDataSource delta import

2011-12-12 Thread Brian Lamb
Hi all,

According to
http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource
a
delta-import is not currently implemented for URLDataSource. I say
currently because I've noticed that such documentation is out of date in
many places. I wanted to see if this feature had been added yet or if there
were plans to do so.

Thanks,

Brian Lamb


Possible to configure the fq caching settings on the server?

2011-12-12 Thread Andrew Lundgren
Is it possible to configure solr such that the filter query cache settings is 
set to fq={!cache=false} by default?

--
Andrew Lundgren
lundg...@familysearch.org


 NOTICE: This email message is for the sole use of the intended recipient(s) 
and may contain confidential and privileged information. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Re: MySQL data import

2011-12-12 Thread Gora Mohanty
On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 Hi all,

 I have a few questions about how the MySQL data import works. It seems it
 creates a separate connection for each entity I create. Is there any way to
 avoid this?

Not sure, but I do not think that it is possible. However, from your description
below, I think that you are unnecessarily multiplying entities.

 By nature of my schema, I have several multivalued fields. Each one I
 populate with a separate entity. Is there a better way to do it? For
 example, could I pull in all the singular data in one sitting and then come
 back in later and populate with the multivalued items.

Not quite sure as to what you mean. Would it be possible for you
to post your schema.xml, and the DIH configuration file? Preferably,
put these on pastebin.com, and send us links. Also, you should
obfuscate details like access passwords.

 An alternate approach in some cases would be to do a GROUP_CONCAT and then
 populate the multivalued column with some transformation. Is that possible?
[...]

This is how we have been handling it. A complete description would
be long, but here is the gist of it:
* A transformer will be needed. In this case, we found it easiest
  to use a Java-based transformer. Thus, your entity should include
  something like
  entity name=myname dataSource=mysource
transformer=com.mycompany.search.solr.handler.JobsNumericTransformer...
  ...
  /entity
 Here, the class name to be used for the transformer attribute follows
 the usual Java rules, and the .jar needs to be made available to Solr.
* The SELECT statement for the entity looks something like
  select group_concat( myfield SEPARATOR '@||@')...
  The separator should be something that does not occur in your
  normal data stream.
* Within the entity, define
   field column=myfield/
* There are complications involved if NULL values are allowed
   for the field, in which case you would need to use COALESCE,
   maybe along with CAST
* The transformer would look up myfield, split along the separator,
   and populate the multi-valued field.

This *is* a little complicated, so I would also like to hear about
possible alternatives.

Regards,
Gora


Re: Trim and copy a solr field

2011-12-12 Thread Juan Grande
Hi Swapna,

You could try using a copyField to a field that uses
PatternReplaceFilterFactory:

fieldType class=solr.TextField name=path_location
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory pattern=(.*)/.*
replacement=$1/
  /analyzer
/fieldType

The regular expression may not be exactly what you want, but it will give
you an idea of how to do it. I'm pretty sure there must be some other ways
of doing this, but this is the first that comes to my mind.

*Juan*



On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala swapna.vupp...@arup.comwrote:

 Hi,

 I have a Solr field that contains the absolute path of the file that is
 indexed, which will be something like
 file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msgfile:///\\myserver\Folder1\SubFol1\Sub-Fol2\Test.msg.

 Am interested in indexing the location in a separate field.  I was looking
 for some way to trim the field value from last occurrence of char /, so
 that I can get the location value, something like
 file:/myserver/Folder1/SubFol1/Sub-Fol2file:///\\myserver\Folder1\SubFol1\Sub-Fol2,
 and store it in a new field. Can you please suggest some way to achieve
 this ?

 Thanks and Regards,
 Swapna.
 
 Electronic mail messages entering and leaving Arup  business
 systems are scanned for acceptability of content and viruses



Re: MySQL data import

2011-12-12 Thread Erick Erickson
You might want to consider just doing the whole
thing in SolrJ with a JDBC connection. When things
get complex, it's sometimes more straightforward.

Best
Erick...

P.S. Yes, it's pretty standard to have a single
field be the destination for several copyField
directives.

On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty g...@mimirtech.com wrote:
 On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
 brian.l...@journalexperts.com wrote:
 Hi all,

 I have a few questions about how the MySQL data import works. It seems it
 creates a separate connection for each entity I create. Is there any way to
 avoid this?

 Not sure, but I do not think that it is possible. However, from your 
 description
 below, I think that you are unnecessarily multiplying entities.

 By nature of my schema, I have several multivalued fields. Each one I
 populate with a separate entity. Is there a better way to do it? For
 example, could I pull in all the singular data in one sitting and then come
 back in later and populate with the multivalued items.

 Not quite sure as to what you mean. Would it be possible for you
 to post your schema.xml, and the DIH configuration file? Preferably,
 put these on pastebin.com, and send us links. Also, you should
 obfuscate details like access passwords.

 An alternate approach in some cases would be to do a GROUP_CONCAT and then
 populate the multivalued column with some transformation. Is that possible?
 [...]

 This is how we have been handling it. A complete description would
 be long, but here is the gist of it:
 * A transformer will be needed. In this case, we found it easiest
  to use a Java-based transformer. Thus, your entity should include
  something like
  entity name=myname dataSource=mysource
 transformer=com.mycompany.search.solr.handler.JobsNumericTransformer...
  ...
  /entity
  Here, the class name to be used for the transformer attribute follows
  the usual Java rules, and the .jar needs to be made available to Solr.
 * The SELECT statement for the entity looks something like
  select group_concat( myfield SEPARATOR '@||@')...
  The separator should be something that does not occur in your
  normal data stream.
 * Within the entity, define
   field column=myfield/
 * There are complications involved if NULL values are allowed
   for the field, in which case you would need to use COALESCE,
   maybe along with CAST
 * The transformer would look up myfield, split along the separator,
   and populate the multi-valued field.

 This *is* a little complicated, so I would also like to hear about
 possible alternatives.

 Regards,
 Gora


Re: Solr Load Testing

2011-12-12 Thread Otis Gospodnetic
Hi,

1000 *concurrent* *queries* is a lot.  If your index is small relatively to hw 
specs, sure.  If not, then tuning may be needed, including maybe Tomcat and JVM 
level tuning.  The error below is from Tomcat, not really tied to Solr...

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


- Original Message -
 From: Kissue Kissue kissue...@gmail.com
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Monday, December 12, 2011 11:43 AM
 Subject: Solr Load Testing
 
 Hi,
 
 I ran some jmeter load testing on my solr instance version 3.5.0 running on
 tomcat 6.6.29 using 1000 concurrent users and the error below is thrown
 after a certain number of requests. My solr configuration is basically the
 default configuration at this time. Has anybody done soemthing similar?
 Should solr be able to handle 1000 concurrent users based on the default
 configuration? Any ideas let me know. Thanks.
 
 12-Dec-2011 15:56:02 org.apache.solr.common.SolrException log
 SEVERE: ClientAbortException:  java.io.IOException
         at
 org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:319)
         at
 org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288)
         at
 org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:98)
         at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278)
         at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
         at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
         at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115)
         at
 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:344)
         at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265)
         at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
         at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
         at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
         at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
         at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
         at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
         at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
         at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
         at
 org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:861)
         at
 org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
         at
 org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1584)
         at java.lang.Thread.run(Thread.java:619)
 Caused by: java.io.IOException
         at
 org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer(InternalAprOutputBuffer.java:696)
         at
 org.apache.coyote.http11.InternalAprOutputBuffer.flush(InternalAprOutputBuffer.java:284)
         at
 org.apache.coyote.http11.Http11AprProcessor.action(Http11AprProcessor.java:1016)
         at org.apache.coyote.Response.action(Response.java:183)
         at
 org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314)
         ... 20 more



Re: performance of json vs xml?

2011-12-12 Thread Mark Miller
On Sun, Dec 11, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote:

 I'm thinking about modifying my index process to use json because all my
 docs are originally in json anyway . Are there any performance issues if I
 insert json docs instead of xml docs?  A colleague recommended to me to
 stay with xml because solr is highly optimized for xml.



I'd make a big bet the JSON parsing is faster than the xml parsing.

And you have the cost of converting your docs to XML...

If you are too worried, do some testing. I'd simply use JSON. The JSON
support should be considered first class - it just came after the XML
support.

-- 
- Mark

http://www.lucidimagination.com


Re: SmartChineseAnalyzer

2011-12-12 Thread Chris Hostetter

: Subject: SmartChineseAnalyzer
: References:
: CAMye=3oOSfePwDEy4Off89jBTUN=K3G0=btaaxghtxvpc_v...@mail.gmail.com
:  can4yxvdc21zehiio+kkws53d_vrqn8tqc3-0qn8kq31unq7...@mail.gmail.com
:  CAMye=3ot32a02px6yotopkkkmobexw7xpv9sxzc32xkra-u...@mail.gmail.com
: In-Reply-To:
: CAMye=3ot32a02px6yotopkkkmobexw7xpv9sxzc32xkra-u...@mail.gmail.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss


Facet on same date field multiple times

2011-12-12 Thread dbashford
I've Googled around a bit and seen this referenced a few times, but cannot
seem to get it to work

I have a query that looks like this:

facet=true
facet.date={!key=foo}date
f.foo.facet.date.start=2010-12-12T00:00:00Z
f.foo.facet.date.end=2011-12-12T00:00:00Z
f.foo.facet.date.gap=%2B1DAY

Eventually the goal is to do different ranges on the same field.  Month by
day.  Day by hour.  Year by week.  Something to that effect.  But I thought
I'd start simple to see if I could get the syntax right and what I have
above doesn't seem to work.

I get:
message Missing required parameter: f.date.facet.date.start (or default:
facet.date.start)
description The request sent by the client was syntactically incorrect
(Missing required parameter: f.date.facet.date.start (or default:
facet.date.start)).

So it doesn't seem interested in me using the local key.  From reading here: 
http://lucene.472066.n3.nabble.com/Date-Faceting-on-Solr-3-1-td3302499.html#a3309517
it would seem i should be able to do it (see the note at the bottom).

I know one option is to copyField the date into a few other spots, and I can
use that as a last resort, but if this works and I'm just arsing something
up...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facet-on-same-date-field-multiple-times-tp3580449p3580449.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet on same date field multiple times

2011-12-12 Thread Chris Hostetter

: Eventually the goal is to do different ranges on the same field.  Month by
: day.  Day by hour.  Year by week.  Something to that effect.  But I thought
: I'd start simple to see if I could get the syntax right and what I have
: above doesn't seem to work.
...
: So it doesn't seem interested in me using the local key.  From reading here: 
: 
http://lucene.472066.n3.nabble.com/Date-Faceting-on-Solr-3-1-td3302499.html#a3309517
: it would seem i should be able to do it (see the note at the bottom).

That was me, and i was wrong in that post ... what worked was changing the 
output key, but using that key to specify the various date (ie: range) 
based params has never worked, and i didn't realize that at the time.

The work to try and fix this is currently being tracked in tihs Jira 
issue, i recently spelled out what i think would be needed to finish it 
up, but i don't think anyone is actively working on it (if you want to 
jump in, patches would certianly be welcome)...

https://issues.apache.org/jira/browse/SOLR-1351

-Hoss


Re: MySQL data import

2011-12-12 Thread Brian Lamb
Thanks all. Erick, is there documentation on doing things with SolrJ and a
JDBC connection?

On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson erickerick...@gmail.comwrote:

 You might want to consider just doing the whole
 thing in SolrJ with a JDBC connection. When things
 get complex, it's sometimes more straightforward.

 Best
 Erick...

 P.S. Yes, it's pretty standard to have a single
 field be the destination for several copyField
 directives.

 On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty g...@mimirtech.com wrote:
  On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
  brian.l...@journalexperts.com wrote:
  Hi all,
 
  I have a few questions about how the MySQL data import works. It seems
 it
  creates a separate connection for each entity I create. Is there any
 way to
  avoid this?
 
  Not sure, but I do not think that it is possible. However, from your
 description
  below, I think that you are unnecessarily multiplying entities.
 
  By nature of my schema, I have several multivalued fields. Each one I
  populate with a separate entity. Is there a better way to do it? For
  example, could I pull in all the singular data in one sitting and then
 come
  back in later and populate with the multivalued items.
 
  Not quite sure as to what you mean. Would it be possible for you
  to post your schema.xml, and the DIH configuration file? Preferably,
  put these on pastebin.com, and send us links. Also, you should
  obfuscate details like access passwords.
 
  An alternate approach in some cases would be to do a GROUP_CONCAT and
 then
  populate the multivalued column with some transformation. Is that
 possible?
  [...]
 
  This is how we have been handling it. A complete description would
  be long, but here is the gist of it:
  * A transformer will be needed. In this case, we found it easiest
   to use a Java-based transformer. Thus, your entity should include
   something like
   entity name=myname dataSource=mysource
  transformer=com.mycompany.search.solr.handler.JobsNumericTransformer...
   ...
   /entity
   Here, the class name to be used for the transformer attribute follows
   the usual Java rules, and the .jar needs to be made available to Solr.
  * The SELECT statement for the entity looks something like
   select group_concat( myfield SEPARATOR '@||@')...
   The separator should be something that does not occur in your
   normal data stream.
  * Within the entity, define
field column=myfield/
  * There are complications involved if NULL values are allowed
for the field, in which case you would need to use COALESCE,
maybe along with CAST
  * The transformer would look up myfield, split along the separator,
and populate the multi-valued field.
 
  This *is* a little complicated, so I would also like to hear about
  possible alternatives.
 
  Regards,
  Gora



Re: Possible to configure the fq caching settings on the server?

2011-12-12 Thread Chris Hostetter

: Is it possible to configure solr such that the filter query cache 
: settings is set to fq={!cache=false} by default?

well, you could always disable the filterCache -- but i get the impression 
you want *most* fq filters to not be cached, but sometimes you'll 
specify some thta you *do* want cached? is that it?

I don't know of anyway to do that (or even anyway to change solr easily to 
make that posisble) for *only* the fq params.

I was going to suggest that something like this should work a a way to 
disable caching of all queries unless you explicitly re-enable it...

  ?cache=false?q={!cache=true}foofq=barfq={!cache=true}yak

...in which case you could change up your q param so it would default to 
being cached (and move that cache=false to a default in your solrconfig 
if you desired)...

  ?cache=false?q={!cache=true v=$qq}qq=foofq=barfq={!cache=true}yak

...but eviddently thta doesn't work.   aparently cache is only consulted 
as a local param, and doesn't default up to the other request (or 
configed default) SolrParams.

I'm not sure if that was intentional or an oversight -- but if you'd like 
to open a Jira requesting that it work someone could probably look into 
it (patches welcome!)


-Hoss


Re: sub query parsing bug???

2011-12-12 Thread Steve Fuchs
Thanks for the reply!

I do believe I have set (or have tried setting) all of those options for the 
default query and none of them seem to help. Anytime an OR appears inside the 
query the default for that query becomes OR. At least thats the anecdotal 
evidence I've encountered.
Also in this case the results do match what the parser is telling me, so I'm 
not getting the results I expect.

As for the second suggestion, the actual fields searched are controlled by the 
user, so it can get more complicated. But even in the single field search I do 
believe I need to use the edismax parser. I have tried the regular query syntax 
for searching one field and find that it can't handle the more complex queries.

Something like 
ref_expertise:(nonlinear OR soliton) AND optical lattice

won't return any documents even though there are many that satisfy those 
requirements. Is there some other way I could be executing this query even in 
the single field case?

Thanks and Thanks in Advance for all help

Steve





On Dec 6, 2011, at 8:26 AM, Erick Erickson wrote:

 Hmmm, does this help?
 
 In Solr 1.4 and prior, you should basically set mm=0 if you want the
 equivilent of q.op=OR, and mm=100% if you want the equivilent of
 q.op=AND. In 3.x and trunk the default value of mm is dictated by the
 q.op param (q.op=AND = mm=100%; q.op=OR = mm=0%). Keep in mind the
 default operator is effected by your schema.xml solrQueryParser
 defaultOperator=xxx/ entry. In older versions of Solr the default
 value is 100% (all clauses must match)
 (from http://wiki.apache.org/solr/DisMaxQParserPlugin).
 
 I don't think you'll see the query parsed as you expect, but the
 results of the query
 should be what you expect. Tricky, eh?
 
 I'm assuming you've simplified the example for clarity and your qf
 will be on more than one field when you use it for real, but if not
 the actual query doesn't need edismax at all.
 
 Best
 Erick
 
 On Mon, Dec 5, 2011 at 10:52 AM, Steve Fuchs st...@aps.org wrote:
 Hello All,
 
 I have my field description listed below, but I don't think its pertinent. 
 As my issue seems to be with the query parser.
 
 I'm currently using an edismax subquery clause to help with my searching as 
 such:
 
 _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) AND 
 \optical lattice\
 
 translates correctly to
 
 +(+((ref_expertise:nonlinear) (ref_expertise:soliton)) 
 +(ref_expertise:optical lattice))
 
 
 but the users expect the default operator to be AND (it is in all simpler 
 searches), however nothing I can do here gets me that same result as above 
 when the search is:
 
 _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) 
 \optical lattice\
 
 this gets converted to:
 
 +(((ref_expertise:nonlinear) (ref_expertise:soliton)) 
 (ref_expertise:optical lattice))
 
 where the optical lattice is optional.
 
 These produce the same results, trying q.op and mm. Also the default search 
 term as  set in the solr.config is AND.
 
 _query_:{!type=edismax q.op=AND qf='ref_expertise'}\(nonlinear OR 
 soliton\)\optical lattice\
 _query_:{!type=edismax mm=1.0 qf='ref_expertise'}\(nonlinear OR 
 soliton\)\optical lattice\
 
 
 
 
 Any ideas???
 
 Thanks In Advance
 
 Steven Fuchs
 
 
 
 
 
 
fieldType name=intl_string class=solr.TextField 
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=2 
 maxGramSize=25 /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory /
  /analyzer
/fieldType
 
 
 
 
 
 
 
 
 



Re: NRT or similar for Solr 3.5?

2011-12-12 Thread Steven Ou
Yeah, running Chrome on OSX and doesn't do anything.

Just switched to Firefox and it works. *But*, also don't seem to be
receiving confirmation email.
--
Steven Ou | 歐偉凡

*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880


2011/12/12 vikram kamath kmar...@gmail.com

 The Onclick handler does not seem to be called on google chrome (Ubuntu ).

 Also , I dont seem to receive the email with the confirmation link on
 registering (I have checked my spam)




 Regards
 Vikram Kamath



 2011/12/12 Nagendra Nagarajayya nnagaraja...@transaxtions.com

  Steven:
 
  There is an onclick handler that allows you to download the src. BTW, an
  early access Solr 3.5 with RankingAlgorithm 1.3 (NRT) release is
  available for download. So please give it a try.
 
  Regards,
 
  - Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
 
  On 12/10/2011 11:18 PM, Steven Ou wrote:
   All the links on the download section link to
 http://solr-ra.tgels.org/#
   --
   Steven Ou | 歐偉凡
  
   *ravn.com* | Chief Technology Officer
   steve...@gmail.com | +1 909-569-9880
  
  
   2011/12/11 Nagendra Nagarajayya nnagaraja...@transaxtions.com
  
   Steven:
  
   Not sure why you had problems, #downloads (
   http://solr-ra.tgels.org/#downloads ) should point you to the
 downloads
   section showing the different versions available for download ? Please
   share if this is not so ( there were downloads yesterday with no
  problems )
  
   Regarding NRT, you can switch between RA and Lucene at query level or
 at
   config level; in the current version with RA, NRT is in effect while
   with lucene, it is not, you can get more information from here:
   http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf
  
   Solr 3.5 with RankingAlgorithm 1.3 should be available next week.
  
   Regards,
  
   - Nagendra Nagarajayya
   http://solr-ra.tgels.org
   http://rankingalgorithm.tgels.org
  
   On 12/9/2011 4:49 PM, Steven Ou wrote:
   Hey Nagendra,
  
   I took a look and Solr-RA looks promising - but:
  
  - I could not figure out how to download it. It seems like all the
  download links just point to #
  - I wasn't looking for another ranking algorithm, so would it be
  possible for me to use NRT but *not* RA (i.e. just use the normal
   Lucene
  library)?
  
   --
   Steven Ou | 歐偉凡
  
   *ravn.com* | Chief Technology Officer
   steve...@gmail.com | +1 909-569-9880
  
  
   On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya 
   nnagaraja...@transaxtions.com wrote:
  
   Steven:
  
   Please take a look at Solr  with RankingAlgorithm. It offers NRT
   functionality. You can set your autoCommit to about 15 mins. You can
  get
   more information from here:
   http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x
   http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x
  
   Regards,
  
   - Nagendra Nagarajayya
   http://solr-ra.tgels.org
   http://rankingalgorithm.tgels.**org 
  http://rankingalgorithm.tgels.org
  
  
   On 12/8/2011 9:30 PM, Steven Ou wrote:
  
   Hi guys,
  
   I'm looking for NRT functionality or similar in Solr 3.5. Is that
   possible?
   From what I understand there's NRT in Solr 4, but I can't figure
 out
   whether or not 3.5 can do it as well?
  
   If not, is it feasible to use an autoCommit every 1000ms? We don't
   currently process *that* much data so I wonder if it's OK to just
   commit
   very often? Obviously not scalable on a large scale, but it is
  feasible
   for
   a relatively small amount of data?
  
   I recently upgraded from Solr 1.4 to 3.5. I had a hard time getting
   everything working smoothly and the process ended up taking my site
   down
   for a couple hours. I am very hesitant to upgrade to Solr 4 if it's
  not
   necessary to get some sort of NRT functionality.
  
   Can anyone help me? Thanks!
   --
   Steven Ou | 歐偉凡
  
   *ravn.com* | Chief Technology Officer
   steve...@gmail.com | +1 909-569-9880
  
  
  
 
 



Removing whitespace

2011-12-12 Thread Devon Baumgarten
Hello,

I am having trouble finding how to remove/ignore whitespace when indexing. The 
only answer I have found suggested that it is necessary to write my own 
tokenizer. Is this true? I want to remove whitespace and special characters 
from the phrase and create N-grams from the result.

Ultimately, the effect I am after is that searching bobdole would match Bob 
Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can 
anyone lend some assistance?

Thanks!

Dev B



Re: Removing whitespace

2011-12-12 Thread Alireza Salimi
That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten 
dbaumgar...@nationalcorp.com wrote:

 Hello,

 I am having trouble finding how to remove/ignore whitespace when indexing.
 The only answer I have found suggested that it is necessary to write my own
 tokenizer. Is this true? I want to remove whitespace and special characters
 from the phrase and create N-grams from the result.

 Ultimately, the effect I am after is that searching bobdole would match
 Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way...
 can anyone lend some assistance?

 Thanks!

 Dev B




-- 
Alireza Salimi
Java EE Developer


RE: Removing whitespace

2011-12-12 Thread Steven A Rowe
Hi Devon,

Something like this should work for you (untested!):

analyzer
  !-- Remove non-word characters; only underscores, letters  numbers 
allowed --
  charFilter class=solr.PatternReplaceCharFilterFactory pattern=\W+ 
replacement=/
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2/
/analyzer

Steve

 -Original Message-
 From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com]
 Sent: Monday, December 12, 2011 4:52 PM
 To: 'solr-user@lucene.apache.org'
 Subject: Removing whitespace
 
 Hello,
 
 I am having trouble finding how to remove/ignore whitespace when indexing.
 The only answer I have found suggested that it is necessary to write my
 own tokenizer. Is this true? I want to remove whitespace and special
 characters from the phrase and create N-grams from the result.
 
 Ultimately, the effect I am after is that searching bobdole would match
 Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better
 way... can anyone lend some assistance?
 
 Thanks!
 
 Dev B



Re: Removing whitespace

2011-12-12 Thread Koji Sekiguchi

(11/12/13 6:51), Devon Baumgarten wrote:

Hello,

I am having trouble finding how to remove/ignore whitespace when indexing. The 
only answer I have found suggested that it is necessary to write my own 
tokenizer. Is this true? I want to remove whitespace and special characters 
from the phrase and create N-grams from the result.


How about using one of existing charfilters?

https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html

https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/MappingCharFilterFactory.html

koji
--
Check out Query Log Visualizer for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/


RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten

-Original Message-
From: Alireza Salimi [mailto:alireza.sal...@gmail.com] 
Sent: Monday, December 12, 2011 4:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Removing whitespace

That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten 
dbaumgar...@nationalcorp.com wrote:

 Hello,

 I am having trouble finding how to remove/ignore whitespace when indexing.
 The only answer I have found suggested that it is necessary to write my own
 tokenizer. Is this true? I want to remove whitespace and special characters
 from the phrase and create N-grams from the result.

 Ultimately, the effect I am after is that searching bobdole would match
 Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way...
 can anyone lend some assistance?

 Thanks!

 Dev B




-- 
Alireza Salimi
Java EE Developer


RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten


Re: MySQL data import

2011-12-12 Thread Erick Erickson
Here's a quick demo I wrote at one point. I haven't run it in a while,
but you should be able to get the idea.


package jdbc;


import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.common.SolrInputDocument;

import java.io.IOException;
import java.sql.*;
import java.util.ArrayList;
import java.util.Collection;


public class Indexer {
  public static void main(String[] args) {
startIndex(http://localhost:8983/solr;);
  }

  private static void startIndex(String url) {
Connection con = DataSource.getConnection();
try {

  long start = System.currentTimeMillis();
  // Create a multi-threaded communications channel to the Solr
server. Full interface (3.3) at:
  // 
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html
  StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer(url, 10, 4);

  // You may want to set these timeouts higer, Solr occasionally
will have long pauses while
  // segments merge.
  server.setSoTimeout(1000);  // socket read timeout
  server.setConnectionTimeout(100);
  //server.setDefaultMaxConnectionsPerHost(100);
  //server.setMaxTotalConnections(100);
  //server.setFollowRedirects(false);  // defaults to false
  // allowCompression defaults to false.
  // Server side must support gzip or deflate for this to have any effect.
  //server.setAllowCompression(true);
  server.setMaxRetries(1); // defaults to 0.   1 not recommended.
  server.setParser(new XMLResponseParser()); // binary parser is
used by default

  doDocuments(server, con);
  server.commit(); // Only needs to be done at the end, autocommit
or commitWithin should
  // do the rest.
  long endTime = System.currentTimeMillis();
  System.out.println(Total Time Taken- + (endTime - start) +  mils);

} catch (Exception e) {
  e.printStackTrace();
  String msg = e.getMessage();
  System.out.println(msg);
}
  }

  private static void doDocuments(StreamingUpdateSolrServer server,
Connection con) throws SQLException, IOException, SolrServerException
{

Statement st = con.createStatement();
ResultSet rs = st.executeQuery(select id,title,text from test);

// SolrInputDocument interface (3.3) at
// 
http://lucene.apache.org/solr/api/org/apache/solr/common/SolrInputDocument.html
CollectionSolrInputDocument docs = new ArrayListSolrInputDocument();
int total = 0;
int counter = 0;

while (rs.next()) {
  SolrInputDocument doc = new SolrInputDocument(); // DO NOT move
this outside the while loop
  // or be sure to call doc.clear()

  String id = rs.getString(id);
  String title = rs.getString(title);
  String text = rs.getString(text);

  doc.addField(id, id);
  doc.addField(title, title);
  doc.addField(text, text);

  docs.add(doc);
  ++counter;
  ++total;
  if (counter  1000) { // Completely arbitrary, just batch up
more than one document for throughput!
server.add(docs);
docs.clear();
counter = 0;
  }
}
System.out.println(Total  + total +  Docs added succesfully);

  }
}

// Trivial class showing connecting to a MySql database server via jdbc...
class DataSource {
  public static Connection getConnection() {
Connection conn = null;
try {

  Class.forName(com.mysql.jdbc.Driver).newInstance();
  System.out.println(Driver Loaded..);
  conn = DriverManager.getConnection(jdbc:mysql://172.16.0.169:3306/test?
+ user=testuserpassword=test123);
  System.out.println(Connection build..);
} catch (Exception ex) {
  System.out.println(ex);
}
return conn;
  }

  public static void closeConnection(Connection con) {
try {
  if (con != null)
con.close();
} catch (SQLException e) {
  e.printStackTrace();
}
  }
}

On Mon, Dec 12, 2011 at 2:57 PM, Brian Lamb
brian.l...@journalexperts.com wrote:
 Thanks all. Erick, is there documentation on doing things with SolrJ and a
 JDBC connection?

 On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 You might want to consider just doing the whole
 thing in SolrJ with a JDBC connection. When things
 get complex, it's sometimes more straightforward.

 Best
 Erick...

 P.S. Yes, it's pretty standard to have a single
 field be the destination for several copyField
 directives.

 On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty g...@mimirtech.com wrote:
  On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
  brian.l...@journalexperts.com wrote:
  Hi all,
 
  I have a few questions about how the MySQL data import works. It seems
 it
  creates a separate connection for each entity I create. Is there any
 way to
  avoid this?
 
  Not sure, but I do not think that it is possible. However, from your
 

Re: Images for the DataImportHandler page

2011-12-12 Thread Chris Hostetter

: There is some very useful information on the 
: http://wiki.apache.org/solr/DataImportHandler page about indexing 
: database contents, but the page contains three images whose links are 
: broken. The descriptions of those images sound like it would be quite 
: handy to see them in the page. Could someone please fix the links so the 
: images are displayed?

Images, and all attachments in general, were disabled some time back for 
all of wiki.apache.org.  Pages that still refer/link to old attachments 
just never got updated after the fact to reflect this.

ASF Infra has a policy permitting individual wiki's to re-enable 
attachment support, but doing so would require switching the entire wiki 
over to a new ACL model, where only people who had been granted explicit 
access to perform edits would be allowed to do so.

My personal opinion is that i'd rather have a low barrier for editing the 
wiki (ie: register and do a textcha) and live w/o images; rather then have 
images, but have a high barrier to editing (ie: register, ask for 
edit permission from a committer, *and* do textchas).  But i'm open to 
other suggestions...

https://wiki.apache.org/general/OurWikiFarm
https://wiki.apache.org/general/OurWikiFarm#Attachments


-Hoss


Re: server down caused by complex query

2011-12-12 Thread Chris Hostetter

: Because our user send very long and complex queries with asterisk and near
: operator.
: Sometimes near operator exceeds 1,000 and keywords almost include asterisk.
: If such query is sent to server, jvm memory is full. (our jvm memory

near operator isn't something I know of as a built in feature of SOlr 
(definitely not Solr 1.4) ... which query parser are you using?  

what is the value of your maxBooleanClauses setting in solrconfig.xml?  
that's the method that should help to limit the risk of query explosion if 
users try to overwelm the server with really large queries, but for 
wildcard and prefix queries (ie: using *) even Solr 1.4 implemented 
those using ConstantScoreQuery instead of using query expansion, so i'm 
no sure how/why a single query could eat up so much ram.

In general, there have been a lot of improvements in memory usage in 
recent versions of Solr, so i suggest you upgrade to Solr 3.5 -- but 
beyond that basic advice any other suggestions will require a *lot* more 
specifics about exactly waht your configs look like, the full requests 
(all params) of queries that are causing you problems, detials on your 
JVM configuration, etc... 

-Hoss


FTP mount crash when crawling with solrj

2011-12-12 Thread hadi
I have a lots of files in my FTP account,and i use the curlftpfs to mount
them to folder and then start index them with solrj api, but after a minutes
pass something strange happen and the mounted folder is not accessible and
crash,also i can not unmount it and the message device is in use appear,
my solrj code is OK and i test it with my local files and the result is
great but indexing mounted folder is my terrible problem, i mention that i
use the curlftpfs with both centOS,fedora and Ubuntu but the result of
crashing is the same,how can i fix this problem? is the problem with the my
code? is sombody have ever face this problem when indexed of mounted folder?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/FTP-mount-crash-when-crawling-with-solrj-tp3580982p3580982.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr-3.5.0/Nutch-1.4 - SolrDeleteDuplicates fails

2011-12-12 Thread Patrick Durusau

Greetings!

On the Nutch Tutorial:

I can run the following commands with Solr-3.5.0/Nutch-1.4:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5


then:

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*



successfully.

But, if I run:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

It fails with the following messages:

SolrIndexer: starting at 2011-12-11 14:01:27

Adding 11 documents

SolrIndexer: finished at 2011-12-11 14:01:28, elapsed: 00:00:01

SolrDeleteDuplicates: starting at 2011-12-11 14:01:28

SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/

Exception in thread main java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)

at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)


at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)


at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I am running on Ubuntu 10.10 with 12 GB of memory, Java version 1.6.0_26.

I can delete the crawl directory and replicate this error consistently.

Suggestions?

Other than ...use the way that doesn't fail. ;-)

I am concerned that a different invocation of Solr failing consistently 
represents something that may cause trouble elsewhere when least 
expected. (And hard to isolate as the problem.)


Thanks!

Hope everyone is having a great weekend!

Patrick

PS: From the hadoop log (when it fails) if that's helpful:

2011-12-11 15:21:51,436 INFO  solr.SolrWriter - Adding 11 documents

2011-12-11 15:21:52,250 INFO  solr.SolrIndexer - SolrIndexer: finished 
at 2011-12-11 15:21:52, elapsed: 00:00:01


2011-12-11 15:21:52,251 INFO  solr.SolrDeleteDuplicates - 
SolrDeleteDuplicates: starting at 2011-12-11 15:21:52


2011-12-11 15:21:52,251 INFO  solr.SolrDeleteDuplicates - 
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/


2011-12-11 15:21:52,330 WARN  mapred.LocalJobRunner - job_local_0020

java.lang.NullPointerException

at org.apache.hadoop.io.Text.encode(Text.java:388)

at org.apache.hadoop.io.Text.set(Text.java:178)

at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)


at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)


at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)


at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)


at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


--
Patrick Durusau
patr...@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
OASIS Technical Advisory Board (TAB) - member

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau



highlighting questions

2011-12-12 Thread Bent Jensen


I am trying to figure out how to display search query fields highlighted in 
html. I can enable the highlighting in the query, and I think I get the correct 
response back (See below: I search using 'Contents' and the highlighting is 
shown with strong and /strong. However, I can't figure out what to add to 
the xslt file to display in html. I think it is a question of defining the 
appropriate xpath(?), but I am stuck. Can someone point me in the right 
direction? Thanks in advance!


Here is the result I get back:
?xml version=1.0 encoding=UTF-8 ? 
- response
- lstname=responseHeader
  intname=status0/int 
  intname=QTime20/int 
- lstname=params
  str name=explainOther/ 
  strname=indenton/str 
  strname=hl.simple.pre'strong'/str 
  strname=hl.fl*/str 
  str name=wt/ 
  strname=hlon/str 
  strname=rows10/str 
  strname=version2.2/str 
  str name=fl/ 
  strname=start0/str 
  strname=qcontents/str 
  strname=hl.simple.post'/strong'/str 
  str name=qt/ 
  str name=fq/ 
  /lst
  /lst
- resultname=responsenumFound=1start=0
- doc
- arrname=content
  strStart with the Table of Contents. See if you can find the topic that you 
are interested in. Look through the section to see if there is a resource that 
can help you. If you find one, you may want to attach a Post-it tab so you can 
find the page later. Write down all of the information that you need to find 
out more information about the resource: agency name, name of contact person, 
telephone number, email and website addresses. If you were unable to find a 
resource that will help you in this resource guide, a good first step would be 
to call your local Independent Living Center. They will have a good idea of 
what is available in your area. A second step would be to call or email us at 
the Rehabilitation Research Center. We have a ROBOT resource specialist who may 
be able to assist. You can reach Lois Roberts, the “Back On Track …To Success” 
Mentoring Program Assistant, at 408-793-6426 or email her at 
lois.robe...@hhs.sccgov.org/str 
  /arr
- arrname=doclink
  strrobot.pdf#page=11/str 
  /arr
  strname=heading1CHAPTER 1: How to Use This Resource Guide/str 
  strname=id1-1/str 
  /doc
  /result
- lstname=highlighting
- lstname=1-1
- arrname=content
  strStart with the Table of 'strong'Contents'/strong'. See if you can 
find the topic that you are interested in. Look/str 
  /arr
  /lst
  /lst
  /response

Re: server down caused by complex query

2011-12-12 Thread Jason
Hellow, Hoss

We're using ComplexPhraseQueryParser and maxBooleanClauses setting is
100.
I know maxBooleanClauses is so big.
But we are expert search organization and queries are very complex and
include wildcard.
So we need it.
Our application receives type of queries like ((A* OR B* OR C*,...) n/2 (X*
OR Y* OR Z*,...)) AND (...)  from user.
Then it is converted into solr query like (A* X*~2 OR A* Y*~2 OR A*
Z*~2 OR B* X*~2 OR ...) AND (...).
Like above, queries for near expression is written repeatedly.
I expect this is inefficient and why jvm memory is full.

I think surround query parser may our solution.
So now we are customizing surround query parser because it is very limited.


Below is out tomcat setenv...
==
export CATALINA_OPTS=-Xms112640m -Xmx112640m
export CATALINA_OPTS=$CATALINA_OPTS -Dserver
export CATALINA_OPTS=$CATALINA_OPTS
-Djava.library.path=/usr/local/lib:/usr/local/apr/lib
export CATALINA_OPTS=$CATALINA_OPTS -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9014
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
export CATALINA_OPTS=$CATALINA_OPTS -Dfile.encoding=utf-8
export CATALINA_OPTS=$CATALINA_OPTS -XX:+UseConcMarkSweepGC
==

Thanks
Jason

--
View this message in context: 
http://lucene.472066.n3.nabble.com/server-down-caused-by-complex-query-tp3535506p3581218.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sub query parsing bug???

2011-12-12 Thread Erick Erickson
Well, your query below becomes ref_expertise:(nonlinear OR soliton)
AND default_search:optical lattice:

The regular Solr/Lucene query should handle pretty much anything you
can throw at it. But do be aware that Solr/Lucene syntax is not true
boolean logic, you have to think in terms of SHOULD, MUST, MUST_NOT.

But this works:
q={!type=edismax qf='name' }(nonlinear OR soliton) AND optical lattice
giving this:
+(+((name:nonlinear) (name:soliton)) +(name:optical lattice))

Best
Erick

On Mon, Dec 12, 2011 at 3:29 PM, Steve Fuchs st...@aps.org wrote:
 Thanks for the reply!

 I do believe I have set (or have tried setting) all of those options for the 
 default query and none of them seem to help. Anytime an OR appears inside the 
 query the default for that query becomes OR. At least thats the anecdotal 
 evidence I've encountered.
 Also in this case the results do match what the parser is telling me, so I'm 
 not getting the results I expect.

 As for the second suggestion, the actual fields searched are controlled by 
 the user, so it can get more complicated. But even in the single field search 
 I do believe I need to use the edismax parser. I have tried the regular query 
 syntax for searching one field and find that it can't handle the more complex 
 queries.

 Something like
 ref_expertise:(nonlinear OR soliton) AND optical lattice

 won't return any documents even though there are many that satisfy those 
 requirements. Is there some other way I could be executing this query even in 
 the single field case?

 Thanks and Thanks in Advance for all help

 Steve





 On Dec 6, 2011, at 8:26 AM, Erick Erickson wrote:

 Hmmm, does this help?

 In Solr 1.4 and prior, you should basically set mm=0 if you want the
 equivilent of q.op=OR, and mm=100% if you want the equivilent of
 q.op=AND. In 3.x and trunk the default value of mm is dictated by the
 q.op param (q.op=AND = mm=100%; q.op=OR = mm=0%). Keep in mind the
 default operator is effected by your schema.xml solrQueryParser
 defaultOperator=xxx/ entry. In older versions of Solr the default
 value is 100% (all clauses must match)
 (from http://wiki.apache.org/solr/DisMaxQParserPlugin).

 I don't think you'll see the query parsed as you expect, but the
 results of the query
 should be what you expect. Tricky, eh?

 I'm assuming you've simplified the example for clarity and your qf
 will be on more than one field when you use it for real, but if not
 the actual query doesn't need edismax at all.

 Best
 Erick

 On Mon, Dec 5, 2011 at 10:52 AM, Steve Fuchs st...@aps.org wrote:
 Hello All,

 I have my field description listed below, but I don't think its pertinent. 
 As my issue seems to be with the query parser.

 I'm currently using an edismax subquery clause to help with my searching as 
 such:

 _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) AND 
 \optical lattice\

 translates correctly to

 +(+((ref_expertise:nonlinear) (ref_expertise:soliton)) 
 +(ref_expertise:optical lattice))


 but the users expect the default operator to be AND (it is in all simpler 
 searches), however nothing I can do here gets me that same result as above 
 when the search is:

 _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) 
 \optical lattice\

 this gets converted to:

 +(((ref_expertise:nonlinear) (ref_expertise:soliton)) 
 (ref_expertise:optical lattice))

 where the optical lattice is optional.

 These produce the same results, trying q.op and mm. Also the default search 
 term as  set in the solr.config is AND.

 _query_:{!type=edismax q.op=AND qf='ref_expertise'}\(nonlinear OR 
 soliton\)\optical lattice\
 _query_:{!type=edismax mm=1.0 qf='ref_expertise'}\(nonlinear OR 
 soliton\)\optical lattice\




 Any ideas???

 Thanks In Advance

 Steven Fuchs






    fieldType name=intl_string class=solr.TextField 
      analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.WordDelimiterFilterFactory 
 preserveOriginal=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.ASCIIFoldingFilterFactory /
        filter class=solr.EdgeNGramFilterFactory minGramSize=2 
 maxGramSize=25 /
      /analyzer
      analyzer type=query
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.ASCIIFoldingFilterFactory /
      /analyzer
    /fieldType












Re: Reducing heap space consumption for large dictionaries?

2011-12-12 Thread Maciej Lisiewski

Hi,

in my index schema I has defined a
DictionaryCompoundWordTokenFilterFactory and a
HunspellStemFilterFactory. Each FilterFactory has a dictionary with
about 100k entries.

To avoid an out of memory error I have to set the heap space to 128m
for 1 index.

Is there a way to reduce the memory consumption when parsing the dictionary?
I need to create several indexes and 128m for each index is too much.


Same problem here - even with an empty index (no data yet) and two 
fields using Hunspell (pl_PL) I had to increase heap size to over 2GB 
for solr to start at all..


Stempel using the very same dictionary works fine with 128M..

--
Maciej Lisiewski


Re: Reducing heap space consumption for large dictionaries?

2011-12-12 Thread Chris Male
Hi,

Its good to hear some feedback on using the Hunspell dictionaries.
 Lucene's support is pretty new so we're obviously looking to improve it.
 Could you open a JIRA issue so we can explore whether there is some ways
to reduce memory consumption?

On Tue, Dec 13, 2011 at 5:37 PM, Maciej Lisiewski c2h...@poczta.fm wrote:

 Hi,

 in my index schema I has defined a
 DictionaryCompoundWordTokenFil**terFactory and a
 HunspellStemFilterFactory. Each FilterFactory has a dictionary with
 about 100k entries.

 To avoid an out of memory error I have to set the heap space to 128m
 for 1 index.

 Is there a way to reduce the memory consumption when parsing the
 dictionary?
 I need to create several indexes and 128m for each index is too much.


 Same problem here - even with an empty index (no data yet) and two fields
 using Hunspell (pl_PL) I had to increase heap size to over 2GB for solr to
 start at all..

 Stempel using the very same dictionary works fine with 128M..

 --
 Maciej Lisiewski




-- 
Chris Male | Software Developer | DutchWorks | www.dutchworks.nl


RE: Trim and copy a solr field

2011-12-12 Thread Swapna Vuppala
Hi Juan,

Thanks for the reply. I tried using this, but I don't see any effect of the 
analyzer/filter.

I tried copying my Solr field to another field of the type defined below. Then 
I indexed couple of documents with the new schema, but I see that both fields 
have got the same value.
Am looking at the indexed data in Luke.

Am assuming that analyzers process the field value (as specified by various 
filters etc) and then store the modified value. Is that true ? What else could 
I be missing here ?

Thanks and Regards,
Swapna.

-Original Message-
From: Juan Grande [mailto:juan.gra...@gmail.com] 
Sent: Monday, December 12, 2011 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Trim and copy a solr field

Hi Swapna,

You could try using a copyField to a field that uses
PatternReplaceFilterFactory:

fieldType class=solr.TextField name=path_location
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory pattern=(.*)/.*
replacement=$1/
  /analyzer
/fieldType

The regular expression may not be exactly what you want, but it will give
you an idea of how to do it. I'm pretty sure there must be some other ways
of doing this, but this is the first that comes to my mind.

*Juan*



On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala swapna.vupp...@arup.comwrote:

 Hi,

 I have a Solr field that contains the absolute path of the file that is
 indexed, which will be something like
 file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msgfile:///\\myserver\Folder1\SubFol1\Sub-Fol2\Test.msg.

 Am interested in indexing the location in a separate field.  I was looking
 for some way to trim the field value from last occurrence of char /, so
 that I can get the location value, something like
 file:/myserver/Folder1/SubFol1/Sub-Fol2file:///\\myserver\Folder1\SubFol1\Sub-Fol2,
 and store it in a new field. Can you please suggest some way to achieve
 this ?

 Thanks and Regards,
 Swapna.
 
 Electronic mail messages entering and leaving Arup  business
 systems are scanned for acceptability of content and viruses



Generic RemoveDuplicatesTokenFilter

2011-12-12 Thread pravesh
Hi All,

Currently, the SOLR's existing 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory
RemoveDuplicatesTokenFilter   filters the duplicate tokens with the same
text and logical at the same position.

In my case, if the same term appears duplicate one after the other then i
need to remove all duplicates and consume only single occurance of the term
(even if the positionincrementgap ==1).

For e.g. the input stream is as:  /quick brown brown brown fox jumps jumps
over the little little lazy brown dog/
Then the output shld be:  quick brown fox jumps over the little lazy brown
dog.

To acheive this, I implemented my own version of
/RemoveDuplicatesTokenFilter/ with overridden /process()/ method as:

  protected Token process(Token t) throws IOException {
  Token nextTok = peek(1);
  if(t!=null  nextTok!=null){
 if(t.termText().equalsIgnoreCase(nextTok.termText())){
return null;
  }
  }
  return t;
  }

The above implementation works as per desired and the continuous duplicates
are getting removed :)

Any advice/feedback for the above implementation :)

Regards
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Generic-RemoveDuplicatesTokenFilter-tp3581656p3581656.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: NRT or similar for Solr 3.5?

2011-12-12 Thread vikram kamath
@Steven .. try some alternate email address(besides google/yahoo)  and
check your spam

[image: twitter] http://twitter.com/kmarkiv[image:
facebook]http://facebook.com/kmarkiv[image:
google-buzz] http://profiles.google.com/kmarkiv#buzz[image:
linkedin]http://linkedin.com/in/vikramkamathc

Regards
Vikram Kamath



2011/12/13 Steven Ou steve...@gmail.com

 Yeah, running Chrome on OSX and doesn't do anything.

 Just switched to Firefox and it works. *But*, also don't seem to be
 receiving confirmation email.
 --
 Steven Ou | 歐偉凡

 *ravn.com* | Chief Technology Officer
 steve...@gmail.com | +1 909-569-9880


 2011/12/12 vikram kamath kmar...@gmail.com

  The Onclick handler does not seem to be called on google chrome (Ubuntu
 ).
 
  Also , I dont seem to receive the email with the confirmation link on
  registering (I have checked my spam)
 
 
 
 
  Regards
  Vikram Kamath
 
 
 
  2011/12/12 Nagendra Nagarajayya nnagaraja...@transaxtions.com
 
   Steven:
  
   There is an onclick handler that allows you to download the src. BTW,
 an
   early access Solr 3.5 with RankingAlgorithm 1.3 (NRT) release is
   available for download. So please give it a try.
  
   Regards,
  
   - Nagendra Nagarajayya
   http://solr-ra.tgels.org
   http://rankingalgorithm.tgels.org
  
  
   On 12/10/2011 11:18 PM, Steven Ou wrote:
All the links on the download section link to
  http://solr-ra.tgels.org/#
--
Steven Ou | 歐偉凡
   
*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880
   
   
2011/12/11 Nagendra Nagarajayya nnagaraja...@transaxtions.com
   
Steven:
   
Not sure why you had problems, #downloads (
http://solr-ra.tgels.org/#downloads ) should point you to the
  downloads
section showing the different versions available for download ?
 Please
share if this is not so ( there were downloads yesterday with no
   problems )
   
Regarding NRT, you can switch between RA and Lucene at query level
 or
  at
config level; in the current version with RA, NRT is in effect while
with lucene, it is not, you can get more information from here:
http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf
   
Solr 3.5 with RankingAlgorithm 1.3 should be available next week.
   
Regards,
   
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org
   
On 12/9/2011 4:49 PM, Steven Ou wrote:
Hey Nagendra,
   
I took a look and Solr-RA looks promising - but:
   
   - I could not figure out how to download it. It seems like all
 the
   download links just point to #
   - I wasn't looking for another ranking algorithm, so would it be
   possible for me to use NRT but *not* RA (i.e. just use the
 normal
Lucene
   library)?
   
--
Steven Ou | 歐偉凡
   
*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880
   
   
On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya 
nnagaraja...@transaxtions.com wrote:
   
Steven:
   
Please take a look at Solr  with RankingAlgorithm. It offers NRT
functionality. You can set your autoCommit to about 15 mins. You
 can
   get
more information from here:
   
 http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x
http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x
   
Regards,
   
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.**org 
   http://rankingalgorithm.tgels.org
   
   
On 12/8/2011 9:30 PM, Steven Ou wrote:
   
Hi guys,
   
I'm looking for NRT functionality or similar in Solr 3.5. Is that
possible?
From what I understand there's NRT in Solr 4, but I can't figure
  out
whether or not 3.5 can do it as well?
   
If not, is it feasible to use an autoCommit every 1000ms? We
 don't
currently process *that* much data so I wonder if it's OK to just
commit
very often? Obviously not scalable on a large scale, but it is
   feasible
for
a relatively small amount of data?
   
I recently upgraded from Solr 1.4 to 3.5. I had a hard time
 getting
everything working smoothly and the process ended up taking my
 site
down
for a couple hours. I am very hesitant to upgrade to Solr 4 if
 it's
   not
necessary to get some sort of NRT functionality.
   
Can anyone help me? Thanks!
--
Steven Ou | 歐偉凡
   
*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880