Re: AW: AW: Lucas

Erik Fäßler Thu, 28 Aug 2014 07:18:12 -0700

Hi all,

thanks for the hint. As far as I could see by a quick glance into the docs, it 
integrates UIMA AEs as Lucene TokenStream components. This is a good idea for a 
more sophisticated NLP analysis in my opinion. However, I think it is not 
suited for largely scaled-out applications. If your UIMA pipeline needs serious 
processing power, the Lucene pipeline is a bottleneck because you either write 
directly into a Lucene index (thus single threaded) or into Solr (threads equal 
to the number of Solr instances). Your could, of course, also write multiple 
Lucene indexes in parallel which would require a merging-post-processing step.
Either way, I feel you would give up on the scaling capabilities of UIMA. 
Perhaps you could use UIMA AS from the inside of a Lucene analysis component 
but it seems overly complicated to me.


With LuCas I can use the full power of UIMA and just send the results into a 
Lucene index or Solr. All the big processing stuff can be done with UIMA AS or, 
in my case, with an arbitrary large set of synchronized CPEs.

My UIMA pipeline runs two and a half days on 10 (older) server machines for the 
complete set of documents I am working with. I wouldn't want to squeeze this 
into a Lucene analyzer.

Again, correct me if I'm mistaken and my points wouldn't be valid. I'd be glad 
to learn about alternatives, and be it only to be properly informed.

Best,

Erik

> On 28 Aug 2014, at 09:21, "Dr. Armin Wegner" <[email protected]> 
> wrote:
> 
> Hello Erik,
> 
> in Lucene 4.9 (maybe earlier), you can replace the Lucene analyzer
> with a UIMA pipeline. At least the docs say so. I don't know how good
> it is becaus I've never used it.
> 
> Cheers,
> Armin
> 
> 
>> On 8/26/14, Erik Fäßler <[email protected]> wrote:
>> Hi all,
>> 
>> actually, I don't use LuCas anymore to write a Lucene index but rather to
>> send the created documents to Solr or ElasticSearch. There are two reasons I
>> continue to use LuCas: It's field merging capabilities and the term cover
>> mechanics.
>> Regarding the field merging: I have a lot of machine learning components in
>> my pipeline, nothing I could do within a Lucene analyzer. So when I
>> recognize entities with an ML component in the text and each entity has an
>> ID, then please consider this example:
>> 
>> Barack Obama entered the White House.
>> 
>> Let's pretend we would require an ML system to recognize "White House" as
>> THE one White House and let's say we gave it the ID "entity1".
>> My goal is to be able to search for the ID in the same way I would do using
>> a synonym filter, thus finding a document by terms that originally were not
>> included in this document's text, AND be able to correctly highlight the
>> corresponding text snippet. So, when I search for "entity1" (e.g. because
>> the user wants to see documents dealing with the White House), I want to
>> find the above example document with the string "Whit House" highlighted.
>> LuCas can do this for me be aligning or merging the text TokenStream with
>> the entity TokenStream, just as it is done within the CAS itself.
>> 
>> If this functionality can be achieved without using LuCas, please tell me,
>> I'd be happy to switch to up-to-date maintained default-components. Until
>> now I am under the impression this cannot be done by another component.
>> 
>> The term cover mechanics allow me to easily distribute terms across document
>> fields in a predefined, possible overlapping, set division, the set cover. I
>> use it to automatically deal with a lot of faceting fields. Here, I can
>> model n:n mappings from CAS indexes to Lucene fields, e.g. mapping terms
>> originating from one CAS index to 10 Lucene fields, or the other way round.
>> Again, if this is easily possible with another existing, maintained
>> component, please point me to it.
>> 
>> In short: I, too, ultimately don't use Lucene but Solr/ES. However, LuCas
>> has some (Lucene) document fine-tuning-tuning capabilities I need/work
>> with.
>> This means: I don't necessarily need LuCas in an Lucene-updated version. I
>> use it more as a fine-tuned TokenStream-smith. I could require it to be
>> updated in the future when LuCas is not able to express a specific feature
>> of a newer Lucene version.
>> 
>> I hope this wall of text was understandable, thanks for reading through it
>> ;-)
>> 
>> Best,
>> 
>> Erik
>> 
>> 
>> 
>>> On 26 Aug 2014, at 09:43, <[email protected]> wrote:
>>> 
>>> Hi Erik and Jörn,
>>> 
>>> I've used Solr in the meantime. It is so easy to quickly write a CAS
>>> consumer that sends documents to a Solr web service. Writing to a Lucene
>>> index is minimally more work. Could this be the reason why nobody cares
>>> about the outdated version? Is there really a need for Lucas and Solrcas
>>> anymore? What do you think? It would be nice to have some opinions on
>>> this.
>>> 
>>> Of all people reading this list, who wants to have a Lucas or Solrcas for
>>> the current version of Lucene?
>>> 
>>> Cheers,
>>> Armin
>>> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erik Fäßler [mailto:[email protected]]
>>> Gesendet: Freitag, 22. August 2014 16:34
>>> An: [email protected]
>>> Betreff: Re: AW: Lucas
>>> 
>>> I am using  LuCas in production in the last SNAPSHOT version that can be
>>> found in the SVN but not in the maven repository. I was also not aware a
>>> patch would be required to get it to work, I am using it in its current
>>> SVN state, including the splitter filter.
>>> I would be willing to help with a migration and contribute to
>>> discussions/plans. However, I won't have time to do it all on my own,
>>> especially since I use it as a bridge to Solr/ElasticSearch that kind of
>>> remedies the version difference. Thus I use it with newer Solr/ES versions
>>> without problems so far.
>>> 
>>> I will be on vacations for two weeks, after that I'd be available for
>>> contributions.
>>> 
>>> Best,
>>> 
>>> Erik
>>> 
>>>> On 22 Aug 2014, at 15:36, Jörn Kottmann <[email protected]> wrote:
>>>> 
>>>> It would probably nice to migrate those to the current versions of
>>>> Lucene/Solr.
>>>> 
>>>> Jörn
>>>> 
>>>>> On 08/13/2014 08:44 AM, [email protected] wrote:
>>>>> Hi Renauld,
>>>>> 
>>>>> that's nice, thank you. Are you using Lucene 4.x or an older version?
>>>>> 
>>>>> It's a while ago, that I've asked that question and I didn't get much
>>>>> response. Is the project dead? Is it just to easy to code a simple
>>>>> annotator for Lucene or Solr to justify the effort maintaining Lucas and
>>>>> Solrcas?
>>>>> 
>>>>> Cheers,
>>>>> Armin
>>>>> 
>>>>> 
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Renaud Richardet [mailto:[email protected]]
>>>>> Gesendet: Montag, 11. August 2014 23:12
>>>>> An: [email protected]
>>>>> Betreff: Re: Lucas
>>>>> 
>>>>> Hi Armin,
>>>>> 
>>>>> I used it a while ago. I had to apply the following patch to make it
>>>>> work:
>>>>> https://gist.github.com/renaud/bc34a48ca22f787f6c11
>>>>> 
>>>>> HTH, Renaud
>>>>> 
>>>>> 
>>>>>> On Mon, Jul 28, 2014 at 2:55 PM, <[email protected]> wrote:
>>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> Is someone using Lucas? It seems to be slightly outdated. It depends
>>>>>> on Lucene 2.9.3. Lucene is at version 4.9.0 right now. Is there an
>>>>>> alternative?
>>>>>> 
>>>>>> Regards,
>>>>>> Armin
>>>>> 
>>>>> --
>>>>> Renaud Richardet
>>>>> Blue Brain Project  PhD candidate
>>>>> EPFL  Station 15
>>>>> CH-1015 Lausanne
>>>>> phone: +41-78-675-9501
>>>>> http://people.epfl.ch/renaud.richardet
>>>> 
>>

Re: AW: AW: Lucas

Reply via email to