[jira] [Comment Edited] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545918#comment-15545918 ] Tristan Nixon edited comment on OPENNLP-776 at 10/4/16 4:44 PM: Sorry, I probably should have removed that older patch and consolidated them into a single patch. was (Author: tnixon): Sorry, I probably should have removed that older patch. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545918#comment-15545918 ] Tristan Nixon commented on OPENNLP-776: --- Sorry, I probably should have removed that older patch. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545887#comment-15545887 ] Joern Kottmann edited comment on OPENNLP-776 at 10/4/16 4:30 PM: - Sorry for the confusion I was speaking all the time about serializable-basemodel.patch and not serialization_proxy.patch. was (Author: joern): Sorry for the confusion is was speaking all the time about serializable-basemodel.patch and not serialization_proxy.patch. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545887#comment-15545887 ] Joern Kottmann commented on OPENNLP-776: Sorry for the confusion is was speaking all the time about serializable-basemodel.patch and not serialization_proxy.patch. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545791#comment-15545791 ] Tristan Nixon commented on OPENNLP-776: --- Well, it's a bit of a messy type hierarchy, since the write( int) method is defined on both the abstract class OutputStream AND on the interface DataOutput, which is inherited by interface ObjectOutput. The ObjectOutputStream class inherits from BOTH OutputStream AND ObjectOutput. However, the Externalizable interface defines the method writeExternal( ObjectOutput ), which implies that there could be other implementations of this interface that are not necessarily subtypes of OutputStream. This is in fact what some other frameworks do - they provide an alternative implementation. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545655#comment-15545655 ] Joern Kottmann commented on OPENNLP-776: Hmm, then I don't understand it. The write method can only be called with an object of type java.io.ObjectOutputStream and that must extend OutputStream, so it should be safe to assume that? No? ObjectOutputStream is a class and not an interface. It is possible to pass in an object of it, or define a new class which extend it, in both cases the object has also the type OutputStream, right? > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545560#comment-15545560 ] Joern Kottmann commented on OPENNLP-776: Can you give me an example? OpenNLP today only runs on Java 7 and is not tested on any other JVMs. So you probably here and there run into issues. Do you run it on Android? I think it is save to simply hand over the stream and assume the type is InputStream / OutputStream. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545212#comment-15545212 ] Joern Kottmann commented on OPENNLP-776: Thanks, looks good, I think we can more or less merge it like that for the 1.6.1 release. One question, in which case can the else block of the if( in instanceof InputStream ) be entered in the read and write methods ? As far as I understand will this always be true, since the type is defined as part of the Java API and won't change. I suggest we drop the else block. I will test this on my cluster in the next days and then report back here. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (OPENNLP-862) BRAT format packages do not handle punctuation correctly when training NER model
[ https://issues.apache.org/jira/browse/OPENNLP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544995#comment-15544995 ] Joern Kottmann edited comment on OPENNLP-862 at 10/4/16 10:22 AM: -- OpenNLP has to tokenize its input text. Brat can avoid this by just letting the user decide how he wants to mark things. In the end you will need a tokenizer, the Whitespace Tokenizer has the issue you mentioned, the SimpleTokenizer splits on character class and will probably work better for you. Anyway, I think it makes sense to add an option to let the Brat parser assume that annotation boundaries are also always token boundaries. It would be very nice if you could send us a patch to add this option. was (Author: joern): OpenNLP has to tokenize its input text. Brat can avoid this by just letting the user decide how he wants to mark things. In the end you will need a tokenizer, the Whitespace Tokenizer has the isue you mentioned, the SimpleTokenizer splits on character class and will probably work better for you. Anyway, I think it makes sense to add an option to let the Brat parser assume that annotation boundaries are also always token boundaries. It would be very nice if you could send us a patch to add this option. > BRAT format packages do not handle punctuation correctly when training NER > model > > > Key: OPENNLP-862 > URL: https://issues.apache.org/jira/browse/OPENNLP-862 > Project: OpenNLP > Issue Type: Bug > Components: Formats >Affects Versions: 1.6.0 >Reporter: Gregory Werner > > BRAT does not require preprocessing of text files in order to add annotations > to text documents. And this is great because I can feed documents from > corpora I am given directly into BRAT. If I have a line such as: > Residence: Athens, Georgia > I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would > generate the offset and everything would be fine. > It appears though that I only get 1 entity correctly processed (and the other > dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the > comma is not separated from Athens. I have 789 annotated raw, non > pre-processed text documents from past efforts. I believe that OpenNLP should > be able to handle lines like the above in the case of the BRAT format code. > It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is > what creates Athens, as a token. I find that the SimpleTokenizer might > perform better with BRAT through my limited testing of raw documents if the > current general approach is held. -- This message was sent by Atlassian JIRA (v6.3.4#6332)