untested, exfiltrated from the old ipad I wrote it on via qr code after
multiple sudden reboots
diff --git a/datagen/process2.cpp b/datagen/process2.cpp
index 1e59d6e..b8388ff 100644
--- a/datagen/process2.cpp
+++ b/datagen/process2.cpp
@@ -49,6 +49,20 @@ struct oid_pair_hash
}
};
So, the thing to do here is apparently to use a language adapter. These mutate
embeddings intended for other models such that minimal training is needed.
If training ones own tokenizers, it would make sense to reduce the vocab size
so there are fewer embeddings, but you could just use a
second send unexpected, likely from finger spasm
i think one of the inhibitions around speed is that if i implement
distributed training it could be a helpful experience. there are other
solutions to speed, but it's harder psychologically for me to learn
the processes of training one of these models. distributed training
would help build
i think one of the inhibitions around speed is that if i implement
distributed training it could be a helpful experience. there are other
solutions to speed, but it's harder psychologically for me to learn
the processes of training one of these models. distributed training
would help build
# groomed/trained tokenizer and embeddings
well i put the tokenizer in with the embeddings code and it seems to
work fine but the embeddings are a couple orders of magnitude larger
than the adapter, and very slow to train. i did not copy the
embeddings from the previous tokenizer (a happenstance
mis-saw. not off by one.
new tokenizer has 9 special tokens.
old tokenizer has 100.
should fill out with more vocab.
my extended trained has for some reason 1 fewer vocab words than the
tokenizer used for the model the past few days
get to debug that!
the tokenizer loading code usually loads from a folder
it looks like the tokenizer files don't clash with the model files,
and so the two folders are used for both
that's helpful
at the same time, i'll want to figure out how to make the scripts use
te tokenizer i trained. they're still using the extended vanilla
tokenizer. i actually trained a tokenizer on source code.
when the model is saved, i think it actually saves the tokenizer alongside it,
so i would just want to make sure i am loading the tokenizer saved
with the model
i burned some time figuring out the longest number of patch tokens i
could generate with the present config on colab, which let me have a
gpu again today. the number was 2168.
given colab gives me bigger gpus, it seemed to make sense to invest a
little bit in figuring out how to use them better
another to do when pulling from many repositories is to balance the
languages included, so there are the same number
i know there is a way of handling when that issue is present, but i do
not know what it is, and it seems to me that balancing the set the
model is built off of would be the most
oh whoops that result has only 30 entries!
gathering data from these could be aided by using git's support for
object filtering, which lets git be used to access a remote repository
without downloading all the objects
oops, i overwrote the file before it sent and totally confused magic-wormhole
whelp
i'm having a little trouble (including funny anomolies) with
web3.storage and ipfs transferring the data generated on my own system
(asciinema recording of one attempt at
https://asciinema.org/a/YTPm9RYdmnkpvItRELRaoKunF ), so i'm planning
to use magic-wormhole instead, but magic-wormhole doesn't
i generated some data locally; it's in one of the *.w3put.log files in
the repo (i also trained a tokenizer).
unfortunately with the long input lengths, it seems i keep crashing
colab loading and tokenizing it.
i haven't caught the error happening yet so i'm not certain.
when it happens, the
some trouble retaining my thoughts
- the loss is reducing again (it's at 1.10-1.14)
- i'm training a tokenizer, but don't have code to save it if the vm
times out before i use it. it looks like it will take 3-4 hours just
to preprocess this data
ok i should move the tokenizer training to a system
i've drafted some code for training embeddings.
i think i'd like to let it groom for a bit before enabling it, because
of the time it spent with the very short length that only included the
header, trying to kind of undo that without adding too many variables
to the model shape.
oh! and it uses the extended tokenizer made by the extend_tokenizer.py script
note: the latest model url is in the w3put.log file in the repository .
it can be out of date due to bugs and latency, so check date
information maybe if there's a question
the model with the funky tokenizer is starting to look useful as if it
produced code on its own. i have not actually seen it do that yet.
[it's hard handling how it's starting to look useful] bugs and
personal issue prevent checking how useful it actually is.
- i accidentally dropped the output
this is where i left off process.cpp
given that i'll need to figure out how to implement (or otherwise
handle) diffing files, possibly including moving them (might make
sense to use an existing git library) ;
it seems a better investment of time would be setting up adapters to
train embeddings.
this is where i left off process.cpp
given that i'll need to figure out how to implement (or otherwise
handle) diffing files, possibly including moving them (might make
sense to use an existing git library) ;
it seems a better investment of time would be setting up adapters to
train embeddings.
time to sleep
who knows what different or similar behaviors may exist tomorrow
speed drops from 140 MB/s to 4 MB/s .
somewhat surprising. i cut corners for clarity and all, and i guess
that mattered some.
of course, catting is different from parsing, too.
still definitely speedy enough.
this is bare bones c++ code to parse the output of git cat-file
--batch --batch-all-objects
it doesn't _do_ anything with its parsed data. it parrots tree and
commit object fields to stdout.
#include
#include
#include
#include
#include
#include
using namespace std;
struct RawObject
{
git cat-file --batch --batch-all-objects outputs a pretty simple
format at about 50MB/s . much faster.
raw commit objects are pretty clear. raw blob objects are just data.
raw tree objects appear to have a binary format: however it is pretty
simple, just a list of <6 byte ascii mode><0-terminated
might make more sense to just dump the git objects, i thought of that
after i had mostly written it; dunno. less looking-things-up this way
i suppose. a little slow, but faster than what i currently have.
this bash script outputs at 100-200 KB/s after taking some time
getting going, on my local system
it outputs in an ascii format that would be easy to parse in c++ with
std::cin or in python with .split(' ')
generate.bash
Description: Binary data
i guess those could be represented with a merge file, like with the
<<< === or any merge format, and then a diff of the resulting
file with that.
seems that would best be done not in a bash script
thinking of merge commits :D these have two parent trees
so basically, each commit has a tree preceding it.
the tree is composed of git objects that can be listed with `git ls-tree
the format i currently have
-
another possible issue with the current adapter thing is that um the
tokenizer uses raw spaces. usually these models replace spaces with a
making a fast data generation script would help my adapter tuning too
here is a github search for AGPL repositories less than 100 MaB large
created before 2015:
https://github.com/search?l==desc=created%3A%3C2015+stars%3A%3E100+size%3A%3C10+license%3Aagpl-3.0=stars=Repositories
the top-starred one in python is searx
my desire around the idea is to see if any data scientists would be
interested in processing AGPL-only code. i'd try to use the rareness
of the idea to try to support free software, and if it turned out it
wasn't rare then i could benefit from prior work, or compromise.
so a bash script would just run git and pipe the output to a program
that processes it and produces say a jsonlines format that data
scientists are used to accepting
oh i could just read stdin
to speed up the data generator and make a mass of data, it would make
sense to combine hist.bash with hist2json.py, and to generate the json
manually rather than using a general lib, since the format is very
simple.
this could also be done in a compiled language, although i presently
remember
[my current data generator is quite slow, unfortunately; but it's
better than nothing]
another thing that would make sense here would be to just generate a
bunch of data, and then share it with somebody interested in making a
model that handles it
adapters do indeed support training embeddings. it is a parameter
passed when enabling adapter training:
https://github.com/adapter-hub/adapter-transformers/pull/245/files#diff-b31f98a320a05bd7744546d866cb04c4ac086ffae583745b969093c17d5cde6dR205
it looks like the trained embeddings are not then
I bundled up some inputs (poorly), and extended the tokenizers to process
symbols found in code, and continued with the same adapter even though those
things had changed.
The loss isn't dropping very fast anymore. There's also a bug I'm running into,
where the code I did not write is just
i'm having trouble continuing to work on this, but i seem able to let
it train for a bit
maybe i can try to keep it training
i'm thinking a little of trying the other model types, too, dunno
what i thought would be good to do was bundling the commit files up
together into the input stream, so
special tokens in these transformers:
key: eos=end-of-stream, bos=beginning-of-stream
longt5 eos_token='' unk_token='' pad_token='' note:
longt5 says it uses pad as bos
xlnet bos_token='' eos_token='' unk_token=''
sep_token='' pad_token='' cls_token=''
transxl eos_token='' unk_token=''
having trouble focusing on combining the files into input. probably
hesitating due to lack of knowledge of how much input the vm's gpu ram
can hold.
makes sense to separate the file data from the commit message data, so
that an arbitrary number of files can be included
similarly, it would be
i ran it again until colab logged me out. the loss dropped to 0.7 or so.
apparently colab lets you have a gpu again if you disable background execution.
i'm running it some more, just for effective use of time.
i looked into how longt5 works, and basically it locally
contextualises the regions
insufficient explanation
the final trained adapter is only a few megabytes large, despite the
model being several gigabytes.
a nagging part of me keeps considering the pretraining content of the
model. i'm not sure what t5 models are trained but, i imagine a
generative or masked model would have more general-purpose
so basically codefudge is some scripts to train an adapter based on
history of a git repository.
hist.bash breaks the git history into *.file files containing commit
message and file content, and *.commit files that contain the file
diff within the commit.at time of writing, each file changed
xlnet seems a more normative way to do this than longt5. notably the
longt5 tokenizer doesn't include tokens for linebreaks.
https://github.com/xloem/codefudge
# CodeFudge
## Ingredients:
1. one git repository of your choosing, with the same parent folder as this one
2. bash, python, pytorch,
below is as far as i got.
when i tried running it on xsum, i ran out of gpu ram. i imagine there
are a number of ways to address that.
i'm thinking a simplest approach might be to use a smaller model, even
if it doesn't have the long context support.
another idea is to produce a shorter set of
the readme says that the first column is the input, and the second
column is the output. sounds reasonable.
the script can also extract specific columns from csv or jsonlines input.
a normal thing to do here (assuming the code works, i could test on
one of their exampels), might be to train a
it's surprising that i did this!
i did not review the implementation structures to see if i missed
anything, once i got it to run
i also did not implement parallelism, tests, nor documentation
during adapter training gpu ram usage was under 1700 MB
i actually did this. added adapters to longt5 in my repo branch, and
successfully trained on a csv file consisting of lines equal to "a,b"
. after 400 dataitems, loss dropped to around 0.6, eval loss was 0.0
i don't really know if it's _working_ or just looks like it is or
something, nor how it
it turns out adapter-transformers actually has a mutated entire copy
of the transformers repository inside it, and replaces this on the
user's system. they update this regularly, but their current version
does not have longt5.
i added longt5 and forked the repo and pushed it. it doesn't work with
long-t5-tglobal-base is using 1539 MB nvidia GPU ram for me with float32
weights. I unloaded my desktop environment and it would perform simple forward
inference for me.
there appears to be useful information on expected ram requirements at
https://huggingface.co/docs/transformers/v4.16.2/en/performance#model-weights
not enough gpu ram
cool to learn about though
the example code doesn't work, but the model loads when i put it in a
normal huggingface 'pipeline' class
it appears to do forward inference on my cpu, haven't figured out how
to put it on the gpu
something i could theoretically try could be to manually process the
weights to reduce their precision
the model is 1G large
i'm suspecting this implies it won't fit in my gpu ram here
i had to install the transformers package from git in order to use LongT5Model
thats' likely something like pip3 install --upgrade
git+https://github.com/huggingface/transformers
it looks like this is the smallest of the most official longt5 models:
https://huggingface.co/google/long-t5-tglobal-base
long story short, from https://arxiv.org/pdf/2112.07916.pdf ,
transient global attention is a new approach to attention invented for
longt5, which appears to reliably outperform local attention in the
same architecture
the appearance of that, combined with finetunings i saw on the
huggingface hub
i am risking my focus in order to spend a little time learning what
the difference might be between local attention and transient global
attention in longt5
as a side note, it is notably that it looks like a little like people
are using longt5 heavily.
i ended up searching huggingface's github for closed issues that
mentioned 'transformer-xl'
it showed me results for other long context transformers than just that one.
the most recent it showed completed i think was longt5.
i didn't find transformer-xl models in the huggingface hub; it was
i seem to be a little interested now in discerning if transformer xl
will run on my local system
there's a doc page at
https://huggingface.co/docs/transformers/model_doc/transfo-xl . it
starts off with this example:
from transformers import TransfoXLConfig, TransfoXLModel
# Initializing a
70 matches
Mail list logo