Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-21 Thread Tanmai Khanna
Yup, if you see the transfer output as well, prepositions fail as transfer matches "pr" and not "pr.*". Hence, all FSTs will be ignoring secondary tags and there will be a separate matching mechanism for secondary tags. The problem with treating secondary tags like primary tags is that secondary

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-21 Thread Daniel Swanson
I think what's written in the proposal is to have pattern matching FSTs skip secondary tags (in this case a small modification to lrx-proc). It was suggested that matching secondary tags would end up as some sort of hash table lookup separate from the FSTs, but I think it could also work to just

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-21 Thread Jonathan Washington
The main thing I worry about here is lrx rules. Currently a lot of pairs have rules that match e.g. tags="adj", but not necessarily tags="adj.*". So something that's normally hargle might now be hargle, and that means the lrx rule won't match. Since we want this to be backwards-compatible

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-20 Thread Tanmai Khanna
Hey Francis, I agree that it does seem like a solution searching for a problem if we look at it in isolation. But it's important to look at this in the context of eliminating trimming. Chronologically, this project was first about and still is, about eliminating dictionary trimming. Modification

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-20 Thread Tanmai Khanna
In a nutshell, by using the source analysis for disambiguation and transfer, we make the translation output better, and by outputting the source surface form instead of the source lemma, we make the output more comprehensible, or post-editable. Tanmai On Tue, Apr 21, 2020 at 12:19 AM Tanmai

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-20 Thread Francis Tyers
El 2020-04-20 19:21, Daniel Swanson escribió: Another way of putting this is that it looks like a technical solution in search of a problem, rather than a problem description in search of a solution. To me the most obvious thing to do with it is to put markup information in secondary tags as

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-20 Thread Daniel Swanson
> Another way of putting this is that it looks like a technical solution > in search of a problem, rather than a problem description in search > of a solution. To me the most obvious thing to do with it is to put markup information in secondary tags as a way of solving the superblank reordering

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-20 Thread Francis Tyers
El 2020-04-20 19:14, Francis Tyers escribió: El 2020-04-20 19:05, Tanmai Khanna escribió: Hey guys, When I proposed the modification to the Apertium stream format earlier, it was rightly pointed out to be a bit premature and not coupled with adequate justification. As part of preparation for my

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-20 Thread Francis Tyers
El 2020-04-20 19:05, Tanmai Khanna escribió: Hey guys, When I proposed the modification to the Apertium stream format earlier, it was rightly pointed out to be a bit premature and not coupled with adequate justification. As part of preparation for my project, I have tried to document the

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-04-20 Thread Tanmai Khanna
Hey guys, When I proposed the modification to the Apertium stream format earlier, it was rightly pointed out to be a bit premature and not coupled with adequate justification. As part of preparation for my project, I have tried to document the modification in a robust way, such that it makes it

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Just to clarify, in this original example: ^potato/patata$ case refers to capitalisation. Morphological case already has a tag, which would be primary information so this wouldn't touch that at all. So if it felt like we're changing the format, we're not and this would continue to be backwards

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Instead of looking at this as modifying or extending the apertium stream format, we could look at this as making tags more versatile by creating a new kind of tags which have a feature:value pair. That's all there is to it, really. In effect, it allows us to pass an arbitrary amount of info in the

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Hi Mikel, > (0) No change should be made without proper regression testing. I think we > all agree on that! > Definitely, and this is something I'll add in the proposal. > (1) I still believe that the functionality should be proven without > rewriting the (critical) format parsing portions in

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Mikel L. Forcada
Folks: A quick round of  comments  after the responses by Tino, Xavi, Tanmai, and Fran. Did I miss anyone? (00) I cannot claim to have thoroughly considered all of the details of the proposal. Therefore, I can change my mind. (0) No change should be made without proper regression testing.

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Mikel, This is a preliminary idea and a suggestion that we discussed only yesterday, but I assure you that it will be justified in an uncontestable way before even one line of code is written. Ensuring backwards compatibility is of utmost importance, and because of this, in the proposal to modify

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tino Didriksen
It's all transparent. Nobody has to add secondary information to the stream. All current pipes will continue to work as-is, unmodified. All old data and files remain valid. The work is to allow for arbitrary secondary information to be added to the stream. Initially for use with surface forms, so

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Xavi Ivars
Missatge de Mikel L. Forcada del dia dg., 29 de març 2020 a les 12:22: > Folks: > > The elders in Apertium will not be surprised if I voiced my opposition to > changing the format in the Apertium formats used between different modules > of the pipeline. In any case, this is affects the core

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Francis Tyers
El 2020-03-29 11:21, Mikel L. Forcada escribió: Folks: The elders in Apertium will not be surprised if I voiced my opposition to changing the format in the Apertium formats used between different modules of the pipeline. In any case, this is affects the core functionality of Apertium in many

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Mikel L. Forcada
Folks: The elders in Apertium will not be surprised if I voiced my opposition to changing the format in the Apertium formats used between different modules of the pipeline. In any case, this is affects the core functionality of Apertium in many ways and its need should be justified in an

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tino Didriksen
No dixes will be harmed during this procedure. Nobody has to touch any existing language files for this work to be incredibly useful. The proposal is to allow the stream to carry secondary information. This secondary information can come from anywhere, and will mostly be dynamic. Initially, the

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
I apologise, it seems like the link got removed when the message sent. Here it is: http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming Thanks Tanmai On Sun, Mar 29, 2020 at 3:11 PM Tanmai Khanna wrote: > Hey guys, > Here's a draft proposal for this project. Any comments

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Hey guys, Here's a draft proposal http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming> for this project. Any comments will be appreciated :) Thanks, Tanmai On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna wrote: > Hi Hèctor, > A fundamental motivation for this proposal is

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Hi Hèctor, A fundamental motivation for this proposal is the possibility of giving the power to each program to use and propagate as much information as it needs in the pipeline. In our discussion on the IRC, Tino Didriksen said: > You should see how much secondary information VISL's streams

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Hèctor Alòs i Font
Hi Tanmai, I am surprised by this proposal. It involves some very important changes that should be better justified. I don't quite understand when should one define the "optional secondary information" in addition to the current morphological fields. Will it be in the language module

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Daniel Swanson
I think you could reasonably consider it consistent, just with primary information having an empty prefix, which makes sense, given that it is primary. On Sat, Mar 28, 2020 at 6:00 PM Scoop Gracie wrote: > Oh, okay, that makes sense. I was also thinking it might make it easier > for humans to

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Scoop Gracie
Oh, okay, that makes sense. I was also thinking it might make it easier for humans to debug the format. On Sat, Mar 28, 2020, 14:55 Tanmai Khanna wrote: > Scoopgracie, > We discussed something similar to this on the IRC, while doing that would > make things very consistent, it would become too

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Tanmai Khanna
Scoopgracie, We discussed something similar to this on the IRC, while doing that would make things very consistent, it would become too verbose, which is why it might be easier to not have the feature:value format for primary information, i,e., information that's almost always going to be there,

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Scoop Gracie
That sounds like a great idea to me. Maybe could even become ? On Sat, Mar 28, 2020, 13:51 Tanmai Khanna wrote: > Hey guys, > As part of the project to eliminate trimming, I had to come up with a way > to include the surface form in the lexical unit and hence modifying the > apertium stream

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Scoop Gracie
Or = On Sat, Mar 28, 2020, 13:58 Scoop Gracie wrote: > That sounds like a great idea to me. Maybe could even become ? > > On Sat, Mar 28, 2020, 13:51 Tanmai Khanna wrote: > >> Hey guys, >> As part of the project to eliminate trimming, I had to come up with a way >> to include the surface form

[Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Tanmai Khanna
Hey guys, As part of the project to eliminate trimming, I had to come up with a way to include the surface form in the lexical unit and hence modifying the apertium stream format. To do this I would have to modify the parsers of every program in the pipeline, and if that has to happen, we