It's getting more normal to use recurrent models that no longer have bounds
on their input and output sizes. This removes half the challenge of this
task. https://github.com/BlinkDL/RWKV-LM
so maybe
pip3 install https://github.com/xloem/GPTb
from GPTB import GPTBLMHeadModel
from transformers.models.gpt2.configuration_gpt2 import GPT2Config
config = GPT2Config() # pass settings or you can pull the config from some
pretrained model and tweak it
config.rebias = True # additional
i'm suspecting some people have been using fairseq for things like this
https://github.com/pytorch/fairseq
it's a facebook project focused on training sequence transformer models.
noticed there was a deep learning related repo on the old gitopia, too,
could be meaningful to look through such
regarding the idea for saving state, that could work here. basically you
take a fancy text generation model and finetune it to produce its own
embeddings by feeding it one token at a time instead of a document, each
time feeding back its generated state as embeddings. it then is possibly
bound by
uhhh the discord i remember the best is eleutherai's. they made gptj
and also an open source coding assistant app for vscode.
Note: I won't be effective at using the cutting edge here, because I
am not hanging in research chats on discord collaborating with
researchers sharing their latest work. Anybody can do that by hopping
through the chat servers, asking around. It feels a little
overwhelming for me.
Another idea:
We could design something using human knowledge or ghidra, then review it
and figure out how it could have designed it on its own.
I'm thinking I'd like to try training a bytestokenizer for bigbird and
extend its sequence length to entire binaries. I expect the result to be
about 30% successful given my lack of experience and time.
idea: a model could be trained to guess the source layout by sequentially
producing filepaths and selecting areas of the source code to consider,
like an agent
that's similar to language generation except the output words/phrases are
unordered: a set of filepaths.
might be interesting to try
- I skimmed bigbird's description a little. it's trained for sequence
lengths of 4096 tokens but it doesn't look like memory requirements would
rise too much if that were increased somehow. curious if you can finetune a
model with increased position embeddings, probably can.
- I glanced at realm
- a large pretrained model that has significant understanding of
english logic and knowledge could be finetuned on bytes by training
perceiver-like cross attention embedding/tokenization encoders and
decoders to match the behaviors if its original tokenizer and
embeddings but accept byte streams.
- a large T5 model could be tpu compiled on colab notebooks by calling
pmap() on individual blocks rather than the whole model
- much larger models could be trained by masking the training weights
to reduce autograd memory load as has been done for at-home training
of large text generation models
- it turns out that deserialization of compiled tpu code isn't
implemented in colab notebooks yet. might be easy to implement, might
be nearly impossible, haven't looked. so not too much was
accomplished by the use of tpu vms other than realising they're there
for when a lot of speed is needed.
this has been going slower than needed because colab was bailing when
i tried to run the model on google's tpus, during compilation.
today i made a google cloud vm and precompiled the model in their
shell, and addded precompilation support to the notebook. it was
_really_ hard to make the vm, my
On 1/19/22, k wrote:
> decompiled function as of today:
> \00 def example_sum(left, right, sum):
> it doesn't look like much, but it's progress
There will be a party if your new ghidra prints
printf("Hello world.\n");
https://github.com/NationalSecurityAgency/ghidra
> might take me a bit to
this is currently autouploading new snapshots of the model training as
it goes, for as long as google lets my notebook stay running. it's
presently between 1.0 and 2.0 loss and is making decompilations that
don't have weird symbols in them. it's training on only a little
under 30k unreviewed and
a jax contributor kindly shared this with me. you can store tpu
models precompiled, which significantly speeds launch time, by using a
compilation cache folder.
from jax.experimental.compilation_cache import compilation_cache as cc
cc.initialize_cache("/path/name/here", max_cache_size_bytes=32 *
um er
- i went back to that and it turned out i had just scrolled up, and
the training was all there
- i think i may have uploaded another snapshot
- i let it train for a number more hours, but when i returned the vm
had run out of ram and X wasn't accepting keyboard input. it took me
some time
- show and tell -
the checkpoint on huggingface currently has a loss of around 2.1, so
it doesn't succeed yet. but it turns out it can produce an output,
and guesses a simple signature correctly:
git clone https://github.com/xloem/techsketball
cd techsketball
python3 demo.py
it compiles a very
note: in my bumbling i found this doc which gives a general intro to
flax/jax/huggingface from google:
https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md
. i'm wondering if stuff like that doc is how jax reached me.
i posted but my email disappeared for me, here's another.
my continuing on this is waning atm, maybe will change.
the first model was lost after a day or two when the notebook closed itself.
reusing the token ids of the T5 tokenizer really speeds training from
the T5 model. i spent some time
note:
- additionally, the perceiver model structure may not need tokenization
- and, google made a new T5 called LongT5 that can handle much larger
data already, code usually released in coming months
given many functions are short, i might skip the length problem for now
but maybe now something
the T5 tokenizer the current code uses removes linebreaks, so the
output source isn't recompilable.
last night i added to find_pycode.py to add functions to train a
tokenizer for the source, preserving linebreaks.
there is a further big issue: embedded strings are not tokenized on
the input
batchsize of 20 is about the same speed
redaction: this is not actually the free colab. to make it work on
the free colab, you'd drop the batchsize so it fit in ram. while
frustrated with the tpu rpc timeouts i bought the paid colab. it
didn't help, turns out because the timeout is hardcoded
it's successfully fitting the model to the task on the colab gpu. the
tpu compilation times out colab's rpc connection to google's cloud.
the eta for 10 runs through my example data is within 520 hours (3
weeks) on the free colab gpu notebook using a batch size of 16.
it's successfully fitting the model to the task on the colab gpu. the
tpu compilation times out colab's rpc connection to google's cloud.
the eta for 10 runs through my example data is within 520 hours (3
weeks) on the free colab gpu notebook using a batch size of 16.
hbm limits relate to the TPU linked to the notebook. a v2-8 (i
think?) has 64 GB which gets split into 8x 8GB if all 8 cores are
used. TRC provides larger TPUs, but it still raises the memory size
issue.
hbm limits relate to the TPU linked to the notebook. a v2-8 (i
think?) has 64 GB which gets split into 8x 8GB if all 8 cores are
used. TRC provides larger TPUs, but it still raises the memory size
issue.
[missing change was committed]
t5-base with batch size of 6 is looking for 22GB of hbm (tpu memory).
crashes complained has only 7 gb, might be a notebook limit or a time
of day thing
[after a number of psychotic breaks] the training loop runs now. it's
likely not running very effectively. for the notebook to run right
now, an uncommitted change is needed:
# compute loss
loss = optax.softmax_cross_entropy(logits,
flax.training.common_utils.onehot(labels,
i've addressed bugs enough that it actually gets to the point where
the tpus evaluate the model with passed data.
so far the first evaluation pass hasn't returned, maybe cause this
demo is low-end, unsure. i have no idea how long it hsould take and
should try a smaller model to continue
i'm looking at
https://github.com/huggingface/transformers/blob/master/examples/flax/summarization/run_summarization_flax.py#L534
, which is for flax as a summarization task, and noting that the
decoder input ids are the labels shifted by one. i'm thinking that
summarization is basically the
wow those two emails are _full_ of errors. don't take the log of
logits, you'll get a double-log probability and nobody will know what
to do with it except people investigating the insides of neural
network models that manipulate other neural network models or
something
oh, and .view(-1, ...) means to squish an n-dimensional vector so that
is has the dimension sizes listed, where -1 means to make that
dimension as large as needed to fit all the needed. so .view(-1)
turns it into a 1-dimensional array.
so, the jax/flax hugging face t5 output doesn't include loss the way
the huggingface t5 documentation implies. the pytorch output does.
here's the loss from the huggingface pytorch t5 code. for me this is
line 1643 of my old checkout of github.com/huggingface/transformers
looks like i pasted together data batching code that doesn't line up
basically the code needs to be mutated such that each batch is a dict,
rather than the whole data. the example at
https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_t5_mlm_flax.py
uses
I've pasted a training function into the .ipynb in
https://github.com/xloem/techsketball/ . it's not mutated into a .py
yet.
i've also added mmaping functionality to the data generator so data
larger than ram can be used and cached between tests. it is not used
yet.
i code with dense bugs due to
The linked function will need to be mutated for T5 per the T5 page linked
earlier in this thread and farther down in my repo readme. Or the page's
instructions could simply be used, rather than this TPU-oriented tutorial.
i got some simple data prepared and into the training implementation
but have not written it, hard to continue.
i'm at
https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/causal_language_modeling_flax.ipynb#scrollTo=GjKzb0zJd-aH
code is
On 12/29/21, Punk-BatSoup-Stasi 2.0 wrote:
> On Wed, 29 Dec 2021 17:44:57 -0500
> k wrote:
>
>> i think this example notebook shows training a transformer model on
>> the free tpus https://colab.research.google.com
>
>
> again, fuck you karl and your fucking JOOGLE SPAM. Take it
yeah i dunno =/
but hey, big corps guiding advanced tech to use big computing
resources and then monopolising control of them is just like how we
spam lists to do things, maybe! use whatcha got?
gotta figure out how to turn the problem into a different solution
i think this example notebook shows training a transformer model on
the free tpus
https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/causal_language_modeling_flax.ipynb
the trick is to get how google's military contract is fueled by the
terrorist propaganda it researches dispensing actually targeting the
problems up there
yay progress! time for me to spin in circles a bit. [crazy]
i wrote a quick call-and-go class to generate short pairs of bytecode
and sourcecode from the python runtime at
https://github.com/xloem/techsketball/blob/main/find_pycode.py
it might be reasonable to use this as a proof of concept, filtering on
input length. since others are likely adding the
i summarized some things at
https://github.com/xloem/techsketball/blob/main/README.md
including a link to that memory reducing paper at
https://arxiv.org/abs/2112.05682 and some python import statements.
there's code for this paper at
https://github.com/AminRezaei0x443/memory-efficient-attention
ok, this was great motion
i think vanilla models have a maximum sequence length. this can be
expanded by altering the algorithm to not be O(n^2) for memory in the
attention function. there's a paper out there on one approach to
this.
another idea is to chunk the text in some way and train
note: the huggingface demo passes information to the model using token ids
token ids are just indexed sets of character orders that occur
together frequently (the tokenizer counts and decides these)
with something based on math, since it's going to be learning using
linear algebra, i'm wondering
The tutorial looks not that great. I'm using google colab notebooks
now to play on google's machines at https://colab.research.google.com/
and reading about the T5 transformer model which was the basis for the
latest big free model, and is commonly used for translation:
This project provides a normative way to train on data _without_
storing it locally which could make it much simpler to use google's
TPU's: https://github.com/activeloopai/Hub
The reason I picked a TPU, google cloud oriented tutorial, is because
google has a 1 month research program for access to higher level tpus.
So, if something basic is set up, then it can be upgraded for free for
a month.
Google cloud sdk is downloading at 20k/sec for me, so I'm thinking a
good
Obviously it's everybody's duty to build this once you believe it's possible.
I found this tutorial for finetuning a language model used for
translation: https://pythonrepo.com/repo/gsarti-t5-flax-gcp
I made this empty github repository that could hold some attempts:
52 matches
Mail list logo