ReaderManager, more drama with things not being closed before closing the Directory

2016-10-19 Thread Trejkaz
Hi all.

I seem to have a situation where ReaderManager is reducing a refCount
to 0 before it actually releases all its references.

It's difficult because it's all mixed up in our framework for multiple
ReaderManagers, which I'm still not convinced works because the
concurrency is impossible to figure out, and probably won't be allowed
to publish in order to have anyone at Lucene look at it either. (Which
is why I hope that someone at Lucene figures out how to manage more
than one index reliably one day...)

The stack trace trying to close the directory is just trying to
refresh the reader, but I guess this reader was the last one using a
Directory, so now we're closing that as well:

java.lang.RuntimeException: Resources inside the directory did not
get closed before closing the directory
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.close(CloseCheckingDirectory.java:109)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer$IndexReaderWrapper.release(DefaultIndexReaderSharer.java:146)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer$IndexReaderWrapper.access$100(DefaultIndexReaderSharer.java:77)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer.release(DefaultIndexReaderSharer.java:45)
at 
com.acme.storage.textindex.DefaultTextIndex$WrappingReaderManager$1.doClose(DefaultTextIndex.java:370)
at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:253)
at 
com.acme.storage.textindex.DefaultTextIndex$WrappingReaderManager.decRef(DefaultTextIndex.java:331)
at 
com.acme.storage.textindex.DefaultTextIndex$WrappingReaderManager.decRef(DefaultTextIndex.java:306)
at 
org.apache.lucene.search.ReferenceManager.release(ReferenceManager.java:274)
at 
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:189)
at 
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)

The stack trace which opened the resource and didn't close it is
apparently the first reader which ReaderManager:

Caused by: java.lang.RuntimeException: unclosed IndexInput: _7d.tvd
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.addOpenResource(CloseCheckingDirectory.java:82)
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.openInput(CloseCheckingDirectory.java:57)
at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.(CompressingTermVectorsReader.java:144)
at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsFormat.vectorsReader(CompressingTermVectorsFormat.java:91)
at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:120)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:65)
at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:58)
at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:50)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:731)
at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:50)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer$CustomReaderManager.(DefaultIndexReaderSharer.java:164)

But if it's the first reader held by the ReaderManager, I wouldn't
expect the refCount to be 0, so it shouldn't be closing the directory.

I can't reproduce this myself, so I can't just dump out conveniently
placed messages to figure out how it's happening...

But has anyone else seen something like this?

CustomReaderManager is probably shareable, it just does this:

private static class CustomReaderManager extends
ReferenceManager {
private CustomReaderManager(Directory directory) throws IOException {
current =
UnInvertingDirectoryReader.wrap(DirectoryReader.open(directory));
}

@Override
protected void decRef(DirectoryReader reference) throws IOException {
reference.decRef();
}

@Override
protected DirectoryReader refreshIfNeeded(DirectoryReader
referenceToRefresh) throws IOException {
return DirectoryReader.openIfChanged(referenceToRefresh);
}

@Override
protected boolean tryIncRef(DirectoryReader reference) {
return reference.tryIncRef();
}

@Override
protected int getRefCount(DirectoryReader reference) {
return reference.getRefCount();
}
}

So basically the same as the normal one, except that it wraps the
reader in an UnInvertingDirectoryReader. The only reason we're forced
to subclass the manager to do this is that if we don't, each
UnInvertingDirectoryReader becomes a new instance, and basic caching
stuff stops working in some way.

DefaultIndexReaderSharer#release() is 

ApacheCon is now less than a month away!

2016-10-19 Thread Rich Bowen
Dear Apache Enthusiast,

ApacheCon Sevilla is now less than a month out, and we need your help
getting the word out. Please tell your colleagues, your friends, and
members of related technical communities, about this event. Rates go up
November 3rd, so register today!

ApacheCon, and Apache Big Data, are the official gatherings of the
Apache Software Foundation, and one of the best places in the world to
meet other members of your project community, gain deeper knowledge
about your favorite Apache projects, learn about the ASF. Your project
doesn't live in a vacuum - it's part of a larger family of projects that
have a shared set of values, as well as a shared governance model. And
many of our project have an overlap in developers, in communities, and
in subject matter, making ApacheCon a great place for cross-pollination
of ideas and of communities.

Some highlights of these events will be:

* Many of our board members and project chairs will be present
* The lightning talks are a great place to hear, and give, short
presentations about what you and other members of the community are
working on
* The key signing gets you linked into the web of trust, and better
able to verify our software releases
* Evening receptions and parties where you can meet community
members in a less formal setting
* The State of the Feather, where you can learn what the ASF has
done in the last year, and what's coming next year
* BarCampApache, an informal unconference-style event, is another
venue for discussing your projects at the ASF

We have a great schedule lined up, covering the wide range of ASF
projects, including:

* CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos -
Carlos Sanchez
* Inner sourcing 101 - Jim Jagielski
* Java Memory Leaks in Modular Environments - Mark Thomas

ApacheCon/Apache Big Data will be held in Sevilla, Spain, at the Melia
Sevilla, November 14th through 18th. You can find out more at
http://apachecon.com/  Other ways to stay up to date with ApacheCon are:

* Follow us on Twitter at @apachecon
* Join us on IRC, at #apachecon on the Freenode IRC network
* Join the apachecon-discuss mailing list by sending email to
apachecon-discuss-subscr...@apache.org
* Or contact me directly at rbo...@apache.org with questions,
comments, or to volunteer to help

See you in Sevilla!

-- 
Rich Bowen: VP, Conferences
rbo...@apache.org
http://apachecon.com/
@apachecon

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-19 Thread Adrien Grand
You would need to override the wrapComponents method in order to wrap the
tokenstream. See for instance Lucene's LimitTokenCountAnalyzer.

Le mar. 18 oct. 2016 à 18:46, Kumaran Ramasubramanian 
a écrit :

> Hi Adrien
>
> How to do this? Any Pointers?
>
> ​
> > If it is fine to add the ascii folding filter at the end of the analysis
>
> chain, then you could use AnalyzerWrapper. ​
> >
>
>
>
>
> ​-
> Kumaran R​
>
>
>
>
>
>
>
>
>
> On Tue, Oct 11, 2016 at 9:59 PM, Kumaran Ramasubramanian <
> kums@gmail.com
> > wrote:
>
> >
> >
> > @Ahmet, Uwe: Thanks a lot for your suggestion. Already i have written
> > custom analyzer as you said. But just trying to avoid new component in my
> > search flow.
> >
> > @Adrien: how to add filter using AnalyzerWrapper. Any pointers?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 11, 2016 at 8:16 PM, Uwe Schindler  wrote:
> >
> >> I'd suggest to use CustomAnalyzer for defining your own analyzer. This
> >> allows to build your own analyzer with the components (tokenizers and
> >> filters) you like to have.
> >>
> >> Uwe
> >>
> >> -
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: u...@thetaphi.de
> >>
> >> > -Original Message-
> >> > From: Adrien Grand [mailto:jpou...@gmail.com]
> >> > Sent: Tuesday, October 11, 2016 4:37 PM
> >> > To: java-user@lucene.apache.org
> >> > Subject: Re: How to add ASCIIFoldingFilter in ClassicAnalyzer
> >> >
> >> > Hi Kumaran,
> >> >
> >> > If it is fine to add the ascii folding filter at the end of the
> analysis
> >> > chain, then you could use AnalyzerWrapper. Otherwise, you need to
> >> create a
> >> > new analyzer that has the same analysis chain as ClassicAnalyzer, plus
> >> an
> >> > ASCIIFoldingFilter.
> >> >
> >> > Le mar. 11 oct. 2016 à 16:22, Kumaran Ramasubramanian
> >> > 
> >> > a écrit :
> >> >
> >> > > Hi All,
> >> > >
> >> > >   Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer
> >> without
> >> > > writing a new custom analyzer ? should i extend StopwordAnalyzerBase
> >> > again?
> >> > >
> >> > >
> >> > > I know that ClassicAnalyzer is final. any special purpose for making
> >> it as
> >> > > final? Because, StandardAnalyzer was not final before ?
> >> > >
> >> > > public final class ClassicAnalyzer extends StopwordAnalyzerBase
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Kumaran R
> >> > >
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
>


Re: POS tagging in Lucene

2016-10-19 Thread Tommaso Teofili
I think it might be helpful to handle POS tags as TypeAttributes so that
the input and output texts would cleaner and you can still filter and
retrieve tokens by type (e.g. with TypeTokenFilter).

My 2 cents,
Tommaso


Il giorno mer 19 ott 2016 alle ore 11:56 Niki Pavlopoulou 
ha scritto:

> Hi Steve,
>
> thank you for your answer. I created a custom Lucene Analyser in the end.
> Just to clarify on what I mean, Lucene works perfectly for pure words, but
> since it does not support POS tagging some workaround needs to be done for
> the analysis of tokens with POS tags. For example:
>
> Input without POS tags: "I love Lucene's library. It is perfect."
> Output: List(love, lucene, library, perfect)
>
> Input with POS tags: "I[PRP] love[VBP] Lucene's[NNP] library[NN] It[PRP]
> is[VBZ] perfect[JJ]"
> Output: List(i[prp], love[vbp], lucene's[nnp], library[nn], it[prp],
> is[vbz], perfect[jj])
> *Desired output*: List(love[vbp], lucene[nnp], library[nn], perfect[jj])
>
> If one does the POS tagging after the analysis, then the tags might be
> wrong as the right syntax has been lost. This is why the POS tagging needs
> to happen early on and then the analysis to take place.
>
> Regards,
> Niki.
>
> On 18 October 2016 at 19:59, Steve Rowe  wrote:
>
> > Hi Niki,
> >
> > > On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou  wrote:
> > >
> > > Hi all,
> > >
> > > I am using Lucene and OpenNLP for POS tagging. I would like to support
> > > biGrams with POS tags as well. For example, I would like something like
> > > that:
> > >
> > > Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> > > Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> > >
> > > The problem above is that I do not have "pure" tokens, like "I", "am"
> > etc.,
> > > so the analysis could be wrong if I add the POS tags as an input in
> > Lucene.
> > > Is there a way to solve this, apart from creating my custome Lucene
> > > analyser?
> >
> > To create your bigrams, check out ShingleFilter: <
> > http://lucene.apache.org/core/6_2_1/analyzers-common/org/
> > apache/lucene/analysis/shingle/ShingleFilter.html>
> >
> > I’m not sure what you mean by “the analysis could be wrong if I add the
> > POS tags as an input in Lucene” - can you give an example?
> >
> > You may be interested in the work-in-progress addition of OpenNLP
> > integration with Lucene here:  > jira/browse/LUCENE-2899>
> >
> > --
> > Steve
> > www.lucidworks.com
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: POS tagging in Lucene

2016-10-19 Thread Niki Pavlopoulou
Hi Steve,

thank you for your answer. I created a custom Lucene Analyser in the end.
Just to clarify on what I mean, Lucene works perfectly for pure words, but
since it does not support POS tagging some workaround needs to be done for
the analysis of tokens with POS tags. For example:

Input without POS tags: "I love Lucene's library. It is perfect."
Output: List(love, lucene, library, perfect)

Input with POS tags: "I[PRP] love[VBP] Lucene's[NNP] library[NN] It[PRP]
is[VBZ] perfect[JJ]"
Output: List(i[prp], love[vbp], lucene's[nnp], library[nn], it[prp],
is[vbz], perfect[jj])
*Desired output*: List(love[vbp], lucene[nnp], library[nn], perfect[jj])

If one does the POS tagging after the analysis, then the tags might be
wrong as the right syntax has been lost. This is why the POS tagging needs
to happen early on and then the analysis to take place.

Regards,
Niki.

On 18 October 2016 at 19:59, Steve Rowe  wrote:

> Hi Niki,
>
> > On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou  wrote:
> >
> > Hi all,
> >
> > I am using Lucene and OpenNLP for POS tagging. I would like to support
> > biGrams with POS tags as well. For example, I would like something like
> > that:
> >
> > Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> > Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> >
> > The problem above is that I do not have "pure" tokens, like "I", "am"
> etc.,
> > so the analysis could be wrong if I add the POS tags as an input in
> Lucene.
> > Is there a way to solve this, apart from creating my custome Lucene
> > analyser?
>
> To create your bigrams, check out ShingleFilter: <
> http://lucene.apache.org/core/6_2_1/analyzers-common/org/
> apache/lucene/analysis/shingle/ShingleFilter.html>
>
> I’m not sure what you mean by “the analysis could be wrong if I add the
> POS tags as an input in Lucene” - can you give an example?
>
> You may be interested in the work-in-progress addition of OpenNLP
> integration with Lucene here:  jira/browse/LUCENE-2899>
>
> --
> Steve
> www.lucidworks.com
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>