Re: NRT or similar for Solr 3.5?

2011-12-10 Thread Steven Ou
All the links on the download section link to http://solr-ra.tgels.org/#
--
Steven Ou | 歐偉凡

*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880


2011/12/11 Nagendra Nagarajayya 

> Steven:
>
> Not sure why you had problems, #downloads (
> http://solr-ra.tgels.org/#downloads ) should point you to the downloads
> section showing the different versions available for download ? Please
> share if this is not so ( there were downloads yesterday with no problems )
>
> Regarding NRT, you can switch between RA and Lucene at query level or at
> config level; in the current version with RA, NRT is in effect while
> with lucene, it is not, you can get more information from here:
> http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf
>
> Solr 3.5 with RankingAlgorithm 1.3 should be available next week.
>
> Regards,
>
> - Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://rankingalgorithm.tgels.org
>
> On 12/9/2011 4:49 PM, Steven Ou wrote:
> > Hey Nagendra,
> >
> > I took a look and Solr-RA looks promising - but:
> >
> >- I could not figure out how to download it. It seems like all the
> >download links just point to "#"
> >- I wasn't looking for another ranking algorithm, so would it be
> >possible for me to use NRT but *not* RA (i.e. just use the normal
> Lucene
> >library)?
> >
> > --
> > Steven Ou | 歐偉凡
> >
> > *ravn.com* | Chief Technology Officer
> > steve...@gmail.com | +1 909-569-9880
> >
> >
> > On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya <
> > nnagaraja...@transaxtions.com> wrote:
> >
> >> Steven:
> >>
> >> Please take a look at Solr  with RankingAlgorithm. It offers NRT
> >> functionality. You can set your autoCommit to about 15 mins. You can get
> >> more information from here:
> >> http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x<
> http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x>
> >>
> >>
> >> Regards,
> >>
> >> - Nagendra Nagarajayya
> >> http://solr-ra.tgels.org
> >> http://rankingalgorithm.tgels.**org 
> >>
> >>
> >> On 12/8/2011 9:30 PM, Steven Ou wrote:
> >>
> >>> Hi guys,
> >>>
> >>> I'm looking for NRT functionality or similar in Solr 3.5. Is that
> >>> possible?
>  From what I understand there's NRT in Solr 4, but I can't figure out
> >>> whether or not 3.5 can do it as well?
> >>>
> >>> If not, is it feasible to use an autoCommit every 1000ms? We don't
> >>> currently process *that* much data so I wonder if it's OK to just
> commit
> >>>
> >>> very often? Obviously not scalable on a large scale, but it is feasible
> >>> for
> >>> a relatively small amount of data?
> >>>
> >>> I recently upgraded from Solr 1.4 to 3.5. I had a hard time getting
> >>> everything working smoothly and the process ended up taking my site
> down
> >>> for a couple hours. I am very hesitant to upgrade to Solr 4 if it's not
> >>> necessary to get some sort of NRT functionality.
> >>>
> >>> Can anyone help me? Thanks!
> >>> --
> >>> Steven Ou | 歐偉凡
> >>>
> >>> *ravn.com* | Chief Technology Officer
> >>> steve...@gmail.com | +1 909-569-9880
> >>>
> >>>
>
>


Re: RegexQuery performance

2011-12-10 Thread Erick Erickson
Hmmm, I don't know all that much about the universe
you're searching (I'm *really* sorry about that, but I
couldn't resist) but I wonder if you can't turn the problem
on its head and do your regex stuff at index time instead.

My off-the-top-of-my-head notion is you implement a
Filter whose job is to emit some "special" tokens when
you find strings like this that allow you to search without
regexes. For instance, in the example you give, you could
index something like...oh... I don't know, ###VER### as
well as the "normal" text of "IRAS-A-FPA-3-RDR-IMPS-V6.0".
Now, when searching for docs with the pattern you used
as an example, you look for ###VER### instead. I guess
it all depends on how many regexes you need to allow.
This wouldn't work at all if you allow users to put in arbitrary
regexes, but if you have a small enough number of patterns
you'll allow, something like this could work.

The Filter I'm thinking of might behave something like a
SynonymFilter and emit multiple tokens at the same position.
You'd have to take some care that the *query* part of the
analyzer chain didn't undo whatever special symbols you used,
but that's all do-able.

I guess the idea here is that if you can map out all the kinds
of regex patterns you want to apply at query time and apply
them at index time instead it might work. Then you have to
work out how to allow the users to pick the special patterns,
but that's a UI problem...

>From a fortune cookie:
"A programmer had a problem that he tried to solve with
regular expressions. Now he has two problems" 

Best
Erick

On Sat, Dec 10, 2011 at 9:20 AM, Jay Luker  wrote:
> Hi Erick,
>
> On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson  
> wrote:
>> Could you show us some examples of the kinds of things
>> you're using regex for? I.e. the raw text and the regex you
>> use to match the example?
>
> Sure!
>
> An example identifier would be "IRAS-A-FPA-3-RDR-IMPS-V6.0", which
> identifies a particular Planetary Data System data set. Another
> example is "ULY-J-GWE-8-NULL-RESULTS-V1.0". These kind of strings
> frequently appear in the references section of the articles, so the
> context looks something like,
>
> " ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System
> Tholen, D. J. 1989, in Asteroids II, ed ... "
>
> The simple & straightforward regex I've been using is
> /[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I
> haven't put my mind to it because I assumed the primary performance
> issue was elsewhere.
>
>> The reason I ask is that perhaps there are other approaches,
>> especially thinking about some clever analyzing at index time.
>>
>> For instance, perhaps NGrams are an option. Perhaps
>> just making WordDelimiterFilterFactory do its tricks. Perhaps.
>
> WordDelimiter does help in the sense that if you search for a specific
> identifier you will usually find fairly accurate results, even for
> cases where the hyphens resulted in the term being broken up. But I'm
> not sure how WordDelimiter can help if I want to search for a pattern.
>
> I tried a few tweaks to the index, like putting a minimum character
> count for terms, making sure WordDelimeter's preserveOriginal is
> turned on, indexing without lowercasing so that I don't have to use
> Pattern.CASE_INSENSITIVE. Performance was not improved significantly.
>
> The new RegexpQuery mentioned by R. Muir looks promising, but I
> haven't built an instance of trunk yet to try it out. Any ohter
> suggestions appreciated.
>
> Thanks!
> --jay
>
>
>> In other words, this could be an "XY problem"
>>
>> Best
>> Erick
>>
>> On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir  wrote:
>>> On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker  wrote:
 Hi,

 I am trying to provide a means to search our corpus of nearly 2
 million fulltext astronomy and physics articles using regular
 expressions. A small percentage of our users need to be able to
 locate, for example, certain types of identifiers that are present
 within the fulltext (grant numbers, dataset identifers, etc).

 My straightforward attempts to do this using RegexQuery have been
 successful only in the sense that I get the results I'm looking for.
 The performance, however, is pretty terrible, with most queries taking
 five minutes or longer. Is this the performance I should expect
 considering the size of my index and the massive number of terms? Are
 there any alternative approaches I could try?

 Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
 min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

 Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
 using a RangeQuery based on document id

 Any suggestions appreciated.

>>>
>>> This RegexQuery is not really scalable in my opinion, its always
>>> l

Re: VelocityResponseWriter's future

2011-12-10 Thread Paul Libbrecht



Le 10 déc. 2011 à 02:56, Erik Hatcher a écrit :
>> It's fast and easy but its testing ability is simply... unpredictable.
> 
> I'm not sure I get what you mean by the testability though.  Could you 
> clarify?   Taken a bit literally with the VRW, there's this in a test case:

Feeling sure of whether boo is going to be output in the snippet below.

#if(blablabla)
  boo!
#end

is going to be executed.
Sorry for the confusion, no (unit-, integration-) testing here.

paul

Re: NRT or similar for Solr 3.5?

2011-12-10 Thread Nagendra Nagarajayya
Steven:

Not sure why you had problems, #downloads (
http://solr-ra.tgels.org/#downloads ) should point you to the downloads
section showing the different versions available for download ? Please
share if this is not so ( there were downloads yesterday with no problems )

Regarding NRT, you can switch between RA and Lucene at query level or at
config level; in the current version with RA, NRT is in effect while
with lucene, it is not, you can get more information from here:
http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf

Solr 3.5 with RankingAlgorithm 1.3 should be available next week.

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 12/9/2011 4:49 PM, Steven Ou wrote:
> Hey Nagendra,
>
> I took a look and Solr-RA looks promising - but:
>
>- I could not figure out how to download it. It seems like all the
>download links just point to "#"
>- I wasn't looking for another ranking algorithm, so would it be
>possible for me to use NRT but *not* RA (i.e. just use the normal Lucene
>library)?
>
> --
> Steven Ou | 歐偉凡
>
> *ravn.com* | Chief Technology Officer
> steve...@gmail.com | +1 909-569-9880
>
>
> On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya <
> nnagaraja...@transaxtions.com> wrote:
>
>> Steven:
>>
>> Please take a look at Solr  with RankingAlgorithm. It offers NRT
>> functionality. You can set your autoCommit to about 15 mins. You can get
>> more information from here:
>> http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x
>>
>>
>> Regards,
>>
>> - Nagendra Nagarajayya
>> http://solr-ra.tgels.org
>> http://rankingalgorithm.tgels.**org 
>>
>>
>> On 12/8/2011 9:30 PM, Steven Ou wrote:
>>
>>> Hi guys,
>>>
>>> I'm looking for NRT functionality or similar in Solr 3.5. Is that
>>> possible?
 From what I understand there's NRT in Solr 4, but I can't figure out
>>> whether or not 3.5 can do it as well?
>>>
>>> If not, is it feasible to use an autoCommit every 1000ms? We don't
>>> currently process *that* much data so I wonder if it's OK to just commit
>>>
>>> very often? Obviously not scalable on a large scale, but it is feasible
>>> for
>>> a relatively small amount of data?
>>>
>>> I recently upgraded from Solr 1.4 to 3.5. I had a hard time getting
>>> everything working smoothly and the process ended up taking my site down
>>> for a couple hours. I am very hesitant to upgrade to Solr 4 if it's not
>>> necessary to get some sort of NRT functionality.
>>>
>>> Can anyone help me? Thanks!
>>> --
>>> Steven Ou | 歐偉凡
>>>
>>> *ravn.com* | Chief Technology Officer
>>> steve...@gmail.com | +1 909-569-9880
>>>
>>>



Re: [jira] [Commented] (SOLR-2961) DIH with threads and TikaEntityProcessor JDBC ISsue

2011-12-10 Thread Mikhail Khludnev
On Sat, Dec 10, 2011 at 11:58 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello David,
>
> I know about DIH thread problems. Some time ago I did quick fix patch for
> 3.4, which passes tests. If you have some time pls try it.
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201110.mbox/%3CCANGii8cOrWXsSvP9EYcRFX_mQBoVdatzRW%2BF0Cq2c%3D6sx8czZw%40mail.gmail.com%3E
> I'm working on fixing it in trunk.
>
> But I've never seen that ClassCastException, it can be an another one bug.
>
> Regards
>
>
> On Sat, Dec 10, 2011 at 10:35 PM, David Webb (Commented) (JIRA) <
> j...@apache.org> wrote:
>
>>
>>[
>> https://issues.apache.org/jira/browse/SOLR-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166926#comment-13166926]
>>
>> David Webb commented on SOLR-2961:
>> --
>>
>> Weird note, when threads="2", processing continues even though the
>> stacktraces are output to the logs.  When threads="6", when the error
>> occues, the DIH process immediately stops and performs a rollback.
>>
>> This is preventing me from using DIH to load and maintain my production
>> index.  Any help is greatly appreciated since I am now at the 11th hour. :)
>>
>> Solr and all components have been stellar up to this point. Great project!
>>
>> > DIH with threads and TikaEntityProcessor JDBC ISsue
>> > ---
>> >
>> > Key: SOLR-2961
>> > URL: https://issues.apache.org/jira/browse/SOLR-2961
>> > Project: Solr
>> >  Issue Type: Bug
>> >  Components: contrib - DataImportHandler
>> >Affects Versions: 3.4, 3.5
>> > Environment: Windows Server 2008, Apache Tomcat 6, Oracle 11g,
>> ojdbc 11.2.0.1
>> >Reporter: David Webb
>> >  Labels: dih, tika
>> > Attachments: data-config.xml
>> >
>> >
>> > I have a DIH Configuration that works great when I dont specify
>> threads="X" in the root entity.  As soon as I give a value for threads, I
>> get the following error messages in the stacktrace.  Please advise.
>> > SEVERE: JdbcDataSource was not closed prior to finalize(), indicates a
>> bug -- POSSIBLE RESOURCE LEAK!!!
>> > Dec 10, 2011 1:18:33 PM
>> org.apache.solr.handler.dataimport.JdbcDataSource closeConnection
>> > SEVERE: Ignoring Error when closing connection
>> > java.sql.SQLRecoverableException: IO Error: Socket closed
>> >   at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:511)
>> >   at
>> oracle.jdbc.driver.PhysicalConnection.close(PhysicalConnection.java:3931)
>> >   at
>> org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:401)
>> >   at
>> org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:392)
>> >   at
>> org.apache.solr.handler.dataimport.JdbcDataSource.finalize(JdbcDataSource.java:380)
>> >   at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
>> >   at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
>> >   at java.lang.ref.Finalizer.access$100(Unknown Source)
>> >   at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
>> > Caused by: java.net.SocketException: Socket closed
>> >   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>> >   at java.net.SocketOutputStream.write(Unknown Source)
>> >   at oracle.net.ns.DataPacket.send(DataPacket.java:199)
>> >   at oracle.net.ns.NetOutputStream.flush(NetOutputStream.java:211)
>> >   at
>> oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:227)
>> >   at oracle.net.ns.NetInputStream.read(NetInputStream.java:175)
>> >   at oracle.net.ns.NetInputStream.read(NetInputStream.java:100)
>> >   at oracle.net.ns.NetInputStream.read(NetInputStream.java:85)
>> >   at
>> oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123)
>> >   at
>> oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79)
>> >   at
>> oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1122)
>> >   at
>> oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:1099)
>> >   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:288)
>> >   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
>> >   at
>> oracle.jdbc.driver.T4C7Ocommoncall.doOLOGOFF(T4C7Ocommoncall.java:61)
>> >   at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:498)
>> >   ... 8 more
>> > Dec 10, 2011 1:18:34 PM
>> org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
>> > SEVERE: Exception in entity : null
>> > org.apache.solr.handler.dataimport.DataImportHandlerException: Failed
>> to initialize DataSource: f2
>> >   at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>> >   at
>> org.apache.so

Re: Problems with SolrUIMA

2011-12-10 Thread Adriana Farina
Hello Tommaso,

Thank you for your answer. 

Yes, I'm using Solr 3.4.0: so what's the right Solr-UIMA module version to use? 
Or I have to use an earlier version of Solr?

Just another question. From what I understood, the Solr-UIMA component should 
work in the following way: when I submit a set of documents to Solr, it sends 
them to the UIMA pipeline; then takes the annotated documents and creates the 
Lucene indexes I can search from through the Solr webapp. Is it right or is 
there something I'm missing?

Thank you again.



Inviato da iPhone

Il giorno 10/dic/2011, alle ore 14:58, Tommaso Teofili 
 ha scritto:

> Hello Adriana,
> 
> your configuration looks fine to me.
> The exception you pasted makes me think you're using a Solr instance at a
> certain version (3.4.0) while the Solr-UIMA module jar is at a different
> version; I remember there has been a change in the
> UpdateRequestProcessorFactory API at some point in time which may cause
> this issue.
> 
> Tommaso
> 
> 
> 
> 2011/12/7 Adriana Farina 
> 
>> Hello,
>> 
>> 
>> I'm trying to use the SolrUIMA component of solr 3.4.0. I modified
>> solrconfig.xml file in the following way:
>> 
>> 
>>   > class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
>> 
>> 
>>   > name="analysisEngine">C:\Users\Stefano\workspace2\UimaComplete\descriptors\analysis_engine\AggregateAE.xml
>>   
>>   true
>>   
>>   
>> false
>> 
>>   text
>> 
>>   
>>   
>> 
>>   org.apache.uima.RegEx
>>   
>> expressionFound
>> campo1
>>   
>> 
>> 
>>   org.apache.uima.LingAnnotator
>>   
>> category
>> campo2
>>   
>>   
>> precision
>> campo3
>>   
>> 
>> 
>>   org.apache.uima.DictionaryEntry
>>   
>> coveredText
>> campo4
>>   
>> 
>>   
>> 
>>   
>>   
>>   
>> 
>> 
>> 
>> I followed the tutorial I found on the solr wiki (
>> http://wiki.apache.org/solr/SolrUIMA) and customized it. However when I
>> start the solr webapp (java -jar start.jar) I get the following exception:
>> 
>> org.apache.solr.common.SolrException: Error Instantiating
>> 
>> UpdateRequestProcessorFactory,
>> org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory is not a
>> org.apache.solr.update.processor.UpdateRequestProcessorFactory
>>   at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:425)
>>   at
>> org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:445)
>>   at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1569)
>>   at org.apache.solr.update.processor.UpdateRequestProcessorChain.init
>> (UpdateRequestProcessorChain.java:57)
>>   at
>> org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:447)
>>   at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1553)
>>   at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1547)
>>   at
>> org.apache.solr.core.SolrCore.loadUpdateProcessorChains(SolrCore.java:620)
>>   at org.apache.solr.core.SolrCore.(SolrCore.java:561)
>>   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:463)
>>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
>>   at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
>>   at
>> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.
>> java:130)
>>   at
>> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
>> 94)
>>   at
>> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>>   at
>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>   at
>> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:
>> 713)
>>   at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
>>   at
>> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:
>> 1282)
>>   at
>> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
>>   at
>> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
>>   at
>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>   at
>> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
>> 152)
>>   at org.mortbay.jetty.handler.ContextHandlerCollection.doStart
>> (ContextHandlerCollection.java:156)
>>   at
>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>   at
>> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
>> 152)
>>   at
>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>>   at
>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>>   at org.mortbay.jetty.Server.doStart(Server.ja

Re: RegexQuery performance

2011-12-10 Thread Jay Luker
Hi Erick,

On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson  wrote:
> Could you show us some examples of the kinds of things
> you're using regex for? I.e. the raw text and the regex you
> use to match the example?

Sure!

An example identifier would be "IRAS-A-FPA-3-RDR-IMPS-V6.0", which
identifies a particular Planetary Data System data set. Another
example is "ULY-J-GWE-8-NULL-RESULTS-V1.0". These kind of strings
frequently appear in the references section of the articles, so the
context looks something like,

" ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System
Tholen, D. J. 1989, in Asteroids II, ed ... "

The simple & straightforward regex I've been using is
/[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I
haven't put my mind to it because I assumed the primary performance
issue was elsewhere.

> The reason I ask is that perhaps there are other approaches,
> especially thinking about some clever analyzing at index time.
>
> For instance, perhaps NGrams are an option. Perhaps
> just making WordDelimiterFilterFactory do its tricks. Perhaps.

WordDelimiter does help in the sense that if you search for a specific
identifier you will usually find fairly accurate results, even for
cases where the hyphens resulted in the term being broken up. But I'm
not sure how WordDelimiter can help if I want to search for a pattern.

I tried a few tweaks to the index, like putting a minimum character
count for terms, making sure WordDelimeter's preserveOriginal is
turned on, indexing without lowercasing so that I don't have to use
Pattern.CASE_INSENSITIVE. Performance was not improved significantly.

The new RegexpQuery mentioned by R. Muir looks promising, but I
haven't built an instance of trunk yet to try it out. Any ohter
suggestions appreciated.

Thanks!
--jay


> In other words, this could be an "XY problem"
>
> Best
> Erick
>
> On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir  wrote:
>> On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker  wrote:
>>> Hi,
>>>
>>> I am trying to provide a means to search our corpus of nearly 2
>>> million fulltext astronomy and physics articles using regular
>>> expressions. A small percentage of our users need to be able to
>>> locate, for example, certain types of identifiers that are present
>>> within the fulltext (grant numbers, dataset identifers, etc).
>>>
>>> My straightforward attempts to do this using RegexQuery have been
>>> successful only in the sense that I get the results I'm looking for.
>>> The performance, however, is pretty terrible, with most queries taking
>>> five minutes or longer. Is this the performance I should expect
>>> considering the size of my index and the massive number of terms? Are
>>> there any alternative approaches I could try?
>>>
>>> Things I've already tried:
>>>  * reducing the sheer number of terms by adding a LengthFilter,
>>> min=6, to my index analysis chain
>>>  * swapping in the JakartaRegexpCapabilities
>>>
>>> Things I intend to try if no one has any better suggestions:
>>>  * chunk up the index and search concurrently, either by sharding or
>>> using a RangeQuery based on document id
>>>
>>> Any suggestions appreciated.
>>>
>>
>> This RegexQuery is not really scalable in my opinion, its always
>> linear to the number of terms except in super-rare circumstances where
>> it can compute a "common prefix" (and slow to boot).
>>
>> You can try svn trunk's RegexpQuery <-- don't forget the "p", instead
>> from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
>> etc)
>>
>> The performance is faster, but keep in mind its only as good as the
>> regular expressions, if the regular expressions are like /.*foo.*/,
>> then
>> its just as slow as wildcard of *foo*.
>>
>> --
>> lucidimagination.com


Re: Problems with SolrUIMA

2011-12-10 Thread Tommaso Teofili
Hello Adriana,

your configuration looks fine to me.
The exception you pasted makes me think you're using a Solr instance at a
certain version (3.4.0) while the Solr-UIMA module jar is at a different
version; I remember there has been a change in the
UpdateRequestProcessorFactory API at some point in time which may cause
this issue.

Tommaso



2011/12/7 Adriana Farina 

> Hello,
>
>
> I'm trying to use the SolrUIMA component of solr 3.4.0. I modified
> solrconfig.xml file in the following way:
>
> 
> class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
>
>  
> name="analysisEngine">C:\Users\Stefano\workspace2\UimaComplete\descriptors\analysis_engine\AggregateAE.xml
>
>true
>
>
>  false
>  
>text
>  
>
>
>  
>org.apache.uima.RegEx
>
>  expressionFound
>  campo1
>
>  
>  
>org.apache.uima.LingAnnotator
>
>  category
>  campo2
>
>
>  precision
>  campo3
>
>  
>  
>org.apache.uima.DictionaryEntry
>
>  coveredText
>  campo4
>
>  
>
>  
>
>
>
>  
>
>
> I followed the tutorial I found on the solr wiki (
> http://wiki.apache.org/solr/SolrUIMA) and customized it. However when I
> start the solr webapp (java -jar start.jar) I get the following exception:
>
> org.apache.solr.common.SolrException: Error Instantiating
>
> UpdateRequestProcessorFactory,
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory is not a
> org.apache.solr.update.processor.UpdateRequestProcessorFactory
>at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:425)
>at
> org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:445)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1569)
>at org.apache.solr.update.processor.UpdateRequestProcessorChain.init
> (UpdateRequestProcessorChain.java:57)
>at
> org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:447)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1553)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1547)
>at
> org.apache.solr.core.SolrCore.loadUpdateProcessorChains(SolrCore.java:620)
>at org.apache.solr.core.SolrCore.(SolrCore.java:561)
>at org.apache.solr.core.CoreContainer.create(CoreContainer.java:463)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
>at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.
> java:130)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
> 94)
>at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:
> 713)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:
> 1282)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
> 152)
>at org.mortbay.jetty.handler.ContextHandlerCollection.doStart
> (ContextHandlerCollection.java:156)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
> 152)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>at org.mortbay.jetty.Server.doStart(Server.java:224)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>at java.lang.reflect.Method.invoke(Unknown Source)
>at org.mortbay.start.Main.invokeMain(Main.java:194)
>at org.mortbay.start.Main.start(Main.java:534)
>at org.mortbay.start.Main.start(Main.java:441)
>at org.mortbay.start.Main.main(Main.java:119)
>
> I can't figure out why I'

Return message from XmlUpdateRequestHandler

2011-12-10 Thread O. Klein
Is there a way to get feedback from XML update messages like:

http://localhost:8983/solr/update?commit=true&stream.body=%3Cdelete%3E%3Cquery%3Eoffice:Bridgewater%3C/query%3E%3C/delete%3E

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Return-message-from-XmlUpdateRequestHandler-tp3575400p3575400.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: VelocityResponseWriter's future

2011-12-10 Thread Erik Hatcher
Oh gosh... there's no "argument" at all.  I dig Velocity (and other things), 
and I'm cool with folks not liking it or not (and you like it, so it seems).  

I'm not sure I get what you mean by the testability though.  Could you clarify? 
  Taken a bit literally with the VRW, there's this in a test case:

SolrQueryRequest req = req("v.template","custom", 
"v.template.custom","$response.response.response_data");
//...
rsp.add("response_data", "testing");
vrw.write(buf, req, rsp);
assertEquals("testing", buf.toString());

Pretty predictable in this context, but I think you mean something different 
than JUnit testing like this.

Regarding documenting comment I made, I was just filling in what I think is 
needed to making it better and I like your wiki page tutorial idea.  But if 
/browse doesn't work literally out of the box when copying the example 
configuration files (and as I said in a previous e-mail, neither does Solr 
Cell, etc) then we don't really have "out-of-the-box"'ness - it requires moving 
JARs or adjusting solrconfig.xml to make these things work.

Erik


On Dec 9, 2011, at 18:01 , Paul Libbrecht wrote:

> Erik,
> 
> don't argue with me about Velocity, I'm using it several hours a day in XWiki.
> It's fast and easy but its testing ability is simply... unpredictable.
> 
> I did not mean to say it is not documented enough but that it could be 
> reformulated as a tutorial wiki page instead of an example software.
> 
> paul
> 
> Le 9 déc. 2011 à 23:17, Erik Hatcher a écrit :
> 
>> s/choice templating languages/template language choices/
>> 
>> Also, meant to include
>> * http://today.java.net/pub/a/today/2003/12/16/velocity.html
>> 
>> On Dec 9, 2011, at 17:07 , Erik Hatcher wrote:
>> 
>>> Paul -
>>> 
>>> Thanks for your feedback.
>>> 
>>> As for JSP... the problem with JSP's is that they must be inside the .war 
>>> file and that is prohibitive for the flexibility of adjusting the vm files 
>>> to "create links to the right resource" easily.  Certainly choice 
>>> templating languages are an opinionated kind of thing, and quite obviously 
>>> I prefer Velocity templating* over pretty much any alternative.  Angle 
>>> brackets are meant for HTML, and mixing JSP and HTML is not very clean to 
>>> me.  And I've built a full-featured browse.jsp and browse.php examples too 
>>> in past lives too :)
>>> 
>>> Regarding it being an example... it's wired into Solr under example/ as-is. 
>>>  Unfortunately, yet understandably, that example gets copied by many to 
>>> start new projects and then the UI needs adjustments to be in line with 
>>> different data (as does the schema and solrconfig, but many folks don't 
>>> adjust those either).  Point taken that it certainly could be 
>>> implemented/documented better though.
>>> 
>>> Erik
>>> 
>>> 
>>> On Dec 9, 2011, at 16:38 , Paul Libbrecht wrote:
>>> 
 Erik,
 
 The VelocityResponseWriter has solved a need by me: provide an interface 
 that shows off an amount of the solr capability with queries close to a 
 developer and a UI that you can mail to colleagues.
 
 The out-of-the-box-ness is crucial here.
 Adjust the vm files was also crucial (e.g. to create links to the right 
 resource).
 
 The VelocityResponseWriter also has a big advantage: it is a very tiny 
 code so it is easy to adapt.
 
 How about making it an example or tutorial?
 
 paul
 
 PS: I'll note that I would prefer a "candid" jsp equivalent (I still do) 
 but it was never available (one day I'll make one).
 
 
 Le 9 déc. 2011 à 22:30, Erik Hatcher a écrit :
 
> So I thought that Solr having a decent HTML search UI out of the box was 
> a good idea.  I still do.  But it's been a bit of a pain to maintain 
> (originally it was a contrib module, then core, then folks didn't want it 
> as a core dependency, and now it is back as a contrib), and the UI has 
> accumulated a fair bit of cruft/ugliness as folks have tacked on "the 
> kitchen sink" into it compared to my idealistic generic (not specific to 
> the example data) lean and clean sensibilities.
> 
> What should be done?  Who actually cares about VRW or the /browse 
> interface?  And if you do care, what do you like or dislike about it?  
> And if you really really care, patches welcome! ;)
> 
> Perhaps, as I'm starting to feel in general about open source pet 
> projects, add-on's, "monkey patches" to open source software, it should 
> be moved out of Solr's repo altogether and maintained elsewhere (say my 
> personal or Lucid's github).
> 
> I appreciate your candid thoughts on this.
> 
>   Erik
> 
 
>>> 
>> 
> 



Re: DIH full import and clean

2011-12-10 Thread O. Klein
To get the behaviour I want when I delete all docs with a XML update and then
did a full-import with clean=false, which does run both root-entities. I
never got preImportDeleteQuery to work with clean=false or had DIH run both
root-entities with clean=true.




O. Klein wrote
> 
> Can someone explain to me why, when I run full import with clean on it
> only runs the last entity and with clean off I get the behaviour I want
> (runs both entities)?
> 
> I thought clean was only to clear the index before running.
> 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-full-import-and-clean-tp3574065p3575270.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Arabic suppport,

2011-12-10 Thread Gora Mohanty
On Sat, Dec 10, 2011 at 5:50 PM, E P  wrote:
> Hi,
>
> how can I add arabic support to the solr?

What exactly do you mean by "arabic support"?
If you mean that you want to index Arabic text,
Solr supports Unicode, and you should be able
to do that transparently, as long as the rest of
your system handles Unicode properly.

Regards,
Gora


Re: schema,

2011-12-10 Thread Gora Mohanty
On Sat, Dec 10, 2011 at 5:49 PM, E P  wrote:
> Hi
> Dose anybody know a good source for schema.xml composition?
> I am new to solr :)

The Wiki contains a lot of information on Solr schemas. You
could start with:
* http://wiki.apache.org/solr/SchemaXml
* http://wiki.apache.org/solr/SchemaDesign
Searching Google for "Solr schema" turns up these, and other
links. Finally, the Solr distribution includes example schema.xml
files under the example/ directory.

Regards,
Gora


Arabic suppport,

2011-12-10 Thread E P

Hi,

how can I add arabic support to the solr?

best


schema,

2011-12-10 Thread E P

Hi
Dose anybody know a good source for schema.xml composition?
I am new to solr :)

Best



Re: Virtual Memory very high

2011-12-10 Thread Yury Kats
On 12/9/2011 11:54 PM, Rohit wrote:
> Hi All,
> 
>  
> 
> Don't know if this question is directly related to this forum, I am running
> Solr in Tomcat on linux server. The moment I start tomcat the virtual memory
> shown using TOP command goes to its max 31.1G and then remains there.
> 
>  
> 
> Is this the right behaviour, why is the virtual memory usage so high. I have
> 36GB of ram on the server.

To limit VIRT memory, change DirectoryFactory in the solrconfig.xml to use
solr.NIOFSDirectoryFactory.