[jira] [Comment Edited] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545918#comment-15545918
 ] 

Tristan Nixon edited comment on OPENNLP-776 at 10/4/16 4:44 PM:


Sorry, I probably should have removed that older patch and consolidated them 
into a single patch.


was (Author: tnixon):
Sorry, I probably should have removed that older patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545918#comment-15545918
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Sorry, I probably should have removed that older patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545887#comment-15545887
 ] 

Joern Kottmann edited comment on OPENNLP-776 at 10/4/16 4:30 PM:
-

Sorry for the confusion I was speaking all the time about 
serializable-basemodel.patch and not serialization_proxy.patch.


was (Author: joern):
Sorry for the confusion is was speaking all the time about 
serializable-basemodel.patch and not serialization_proxy.patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545887#comment-15545887
 ] 

Joern Kottmann commented on OPENNLP-776:


Sorry for the confusion is was speaking all the time about 
serializable-basemodel.patch and not serialization_proxy.patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545791#comment-15545791
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Well, it's a bit of a messy type hierarchy, since the write( int) method is 
defined on both the abstract class OutputStream AND on the interface 
DataOutput, which is inherited by interface ObjectOutput. The 
ObjectOutputStream class inherits from BOTH OutputStream AND ObjectOutput. 
However, the Externalizable interface defines the method writeExternal( 
ObjectOutput ), which implies that there could be other implementations of this 
interface that are not necessarily subtypes of OutputStream. This is in fact 
what some other frameworks do - they provide an alternative implementation.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545655#comment-15545655
 ] 

Joern Kottmann commented on OPENNLP-776:


Hmm, then I don't understand it. The write method can only be called with an 
object of type java.io.ObjectOutputStream and that must extend OutputStream, so 
it should be safe to assume that? No?

ObjectOutputStream is a class and not an interface. It is possible to pass in 
an object of it, or define a new class which extend it, in both cases the 
object has also the type OutputStream, right?


> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545560#comment-15545560
 ] 

Joern Kottmann commented on OPENNLP-776:


Can you give me an example? OpenNLP today only runs on Java 7 and is not tested 
on any other JVMs. So you probably here and there run into issues.
Do you run it on Android? I think it is save to simply hand over the stream and 
assume the type is InputStream / OutputStream.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545212#comment-15545212
 ] 

Joern Kottmann commented on OPENNLP-776:


Thanks, looks good, I think we can more or less merge it like that for the 
1.6.1 release. One question, in which case can the else block of the if( in 
instanceof InputStream ) be entered in the read and write methods ? As far as I 
understand will this always be true, since the type is defined as part of the 
Java API and won't change. I suggest we drop the else block.

I will test this on my cluster in the next days and then report back here.


> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (OPENNLP-862) BRAT format packages do not handle punctuation correctly when training NER model

2016-10-04 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544995#comment-15544995
 ] 

Joern Kottmann edited comment on OPENNLP-862 at 10/4/16 10:22 AM:
--

OpenNLP has to tokenize its input text. Brat can avoid this by just letting the 
user decide how he wants to mark things. In the end you will need a tokenizer, 
the Whitespace Tokenizer has the issue you mentioned, the SimpleTokenizer 
splits on character class and will probably work better for you.

Anyway, I think it makes sense to add an option to let the Brat parser assume 
that annotation boundaries are also always token boundaries. It would be very 
nice if you could send us a patch to add this option.


was (Author: joern):
OpenNLP has to tokenize its input text. Brat can avoid this by just letting the 
user decide how he wants to mark things. In the end you will need a tokenizer, 
the Whitespace Tokenizer has the isue you mentioned, the SimpleTokenizer splits 
on character class and will probably work better for you.

Anyway, I think it makes sense to add an option to let the Brat parser assume 
that annotation boundaries are also always token boundaries. It would be very 
nice if you could send us a patch to add this option.

> BRAT format packages do not handle punctuation correctly when training NER 
> model
> 
>
> Key: OPENNLP-862
> URL: https://issues.apache.org/jira/browse/OPENNLP-862
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Formats
>Affects Versions: 1.6.0
>Reporter: Gregory Werner
>
> BRAT does not require preprocessing of text files in order to add annotations 
> to text documents.  And this is great because I can feed documents from 
> corpora I am given directly into BRAT.  If I have a line such as:
> Residence:   Athens, Georgia
> I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would 
> generate the offset and everything would be fine.  
> It appears though that I only get 1 entity correctly processed (and the other 
> dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the 
> comma is not separated from Athens.  I have 789 annotated raw, non 
> pre-processed text documents from past efforts. I believe that OpenNLP should 
> be able to handle lines like the above in the case of the BRAT format code.
> It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is 
> what creates Athens, as a token.  I find that the SimpleTokenizer might 
> perform better with BRAT through my limited testing of raw documents if the 
> current general approach is held.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)