Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-29 Thread Undiscussed Past Horrific Abuse, One Victim Of Many
untested, exfiltrated from the old ipad I wrote it on via qr code after multiple sudden reboots diff --git a/datagen/process2.cpp b/datagen/process2.cpp index 1e59d6e..b8388ff 100644 --- a/datagen/process2.cpp +++ b/datagen/process2.cpp @@ -49,6 +49,20 @@ struct oid_pair_hash } };

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-29 Thread Undiscussed Past Horrific Abuse, One Victim Of Many
So, the thing to do here is apparently to use a language adapter. These mutate embeddings intended for other models such that minimal training is needed. If training ones own tokenizers, it would make sense to reduce the vocab size so there are fewer embeddings, but you could just use a

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-28 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
second send unexpected, likely from finger spasm

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-28 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i think one of the inhibitions around speed is that if i implement distributed training it could be a helpful experience. there are other solutions to speed, but it's harder psychologically for me to learn the processes of training one of these models. distributed training would help build

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-28 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i think one of the inhibitions around speed is that if i implement distributed training it could be a helpful experience. there are other solutions to speed, but it's harder psychologically for me to learn the processes of training one of these models. distributed training would help build

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-28 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
# groomed/trained tokenizer and embeddings well i put the tokenizer in with the embeddings code and it seems to work fine but the embeddings are a couple orders of magnitude larger than the adapter, and very slow to train. i did not copy the embeddings from the previous tokenizer (a happenstance

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
mis-saw. not off by one. new tokenizer has 9 special tokens. old tokenizer has 100. should fill out with more vocab.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
my extended trained has for some reason 1 fewer vocab words than the tokenizer used for the model the past few days get to debug that!

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
the tokenizer loading code usually loads from a folder it looks like the tokenizer files don't clash with the model files, and so the two folders are used for both that's helpful

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
at the same time, i'll want to figure out how to make the scripts use te tokenizer i trained. they're still using the extended vanilla tokenizer. i actually trained a tokenizer on source code.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
when the model is saved, i think it actually saves the tokenizer alongside it, so i would just want to make sure i am loading the tokenizer saved with the model

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i burned some time figuring out the longest number of patch tokens i could generate with the present config on colab, which let me have a gpu again today. the number was 2168. given colab gives me bigger gpus, it seemed to make sense to invest a little bit in figuring out how to use them better

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
another to do when pulling from many repositories is to balance the languages included, so there are the same number i know there is a way of handling when that issue is present, but i do not know what it is, and it seems to me that balancing the set the model is built off of would be the most

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
oh whoops that result has only 30 entries!

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-26 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
gathering data from these could be aided by using git's support for object filtering, which lets git be used to access a remote repository without downloading all the objects

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
oops, i overwrote the file before it sent and totally confused magic-wormhole whelp

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i'm having a little trouble (including funny anomolies) with web3.storage and ipfs transferring the data generated on my own system (asciinema recording of one attempt at https://asciinema.org/a/YTPm9RYdmnkpvItRELRaoKunF ), so i'm planning to use magic-wormhole instead, but magic-wormhole doesn't

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i generated some data locally; it's in one of the *.w3put.log files in the repo (i also trained a tokenizer). unfortunately with the long input lengths, it seems i keep crashing colab loading and tokenizing it. i haven't caught the error happening yet so i'm not certain. when it happens, the

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
some trouble retaining my thoughts - the loss is reducing again (it's at 1.10-1.14) - i'm training a tokenizer, but don't have code to save it if the vm times out before i use it. it looks like it will take 3-4 hours just to preprocess this data ok i should move the tokenizer training to a system

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i've drafted some code for training embeddings. i think i'd like to let it groom for a bit before enabling it, because of the time it spent with the very short length that only included the header, trying to kind of undo that without adding too many variables to the model shape.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
oh! and it uses the extended tokenizer made by the extend_tokenizer.py script

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
note: the latest model url is in the w3put.log file in the repository . it can be out of date due to bugs and latency, so check date information maybe if there's a question

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
the model with the funky tokenizer is starting to look useful as if it produced code on its own. i have not actually seen it do that yet. [it's hard handling how it's starting to look useful] bugs and personal issue prevent checking how useful it actually is. - i accidentally dropped the output

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
this is where i left off process.cpp given that i'll need to figure out how to implement (or otherwise handle) diffing files, possibly including moving them (might make sense to use an existing git library) ; it seems a better investment of time would be setting up adapters to train embeddings.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-25 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
this is where i left off process.cpp given that i'll need to figure out how to implement (or otherwise handle) diffing files, possibly including moving them (might make sense to use an existing git library) ; it seems a better investment of time would be setting up adapters to train embeddings.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
time to sleep who knows what different or similar behaviors may exist tomorrow

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
speed drops from 140 MB/s to 4 MB/s . somewhat surprising. i cut corners for clarity and all, and i guess that mattered some. of course, catting is different from parsing, too. still definitely speedy enough.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
this is bare bones c++ code to parse the output of git cat-file --batch --batch-all-objects it doesn't _do_ anything with its parsed data. it parrots tree and commit object fields to stdout. #include #include #include #include #include #include using namespace std; struct RawObject {

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
git cat-file --batch --batch-all-objects outputs a pretty simple format at about 50MB/s . much faster. raw commit objects are pretty clear. raw blob objects are just data. raw tree objects appear to have a binary format: however it is pretty simple, just a list of <6 byte ascii mode><0-terminated

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
might make more sense to just dump the git objects, i thought of that after i had mostly written it; dunno. less looking-things-up this way i suppose. a little slow, but faster than what i currently have.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
this bash script outputs at 100-200 KB/s after taking some time getting going, on my local system it outputs in an ascii format that would be easy to parse in c++ with std::cin or in python with .split(' ') generate.bash Description: Binary data

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i guess those could be represented with a merge file, like with the <<< === or any merge format, and then a diff of the resulting file with that. seems that would best be done not in a bash script

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
thinking of merge commits :D these have two parent trees

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
so basically, each commit has a tree preceding it. the tree is composed of git objects that can be listed with `git ls-tree the format i currently have - another possible issue with the current adapter thing is that um the tokenizer uses raw spaces. usually these models replace spaces with a

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
making a fast data generation script would help my adapter tuning too

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
here is a github search for AGPL repositories less than 100 MaB large created before 2015: https://github.com/search?l==desc=created%3A%3C2015+stars%3A%3E100+size%3A%3C10+license%3Aagpl-3.0=stars=Repositories the top-starred one in python is searx

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
my desire around the idea is to see if any data scientists would be interested in processing AGPL-only code. i'd try to use the rareness of the idea to try to support free software, and if it turned out it wasn't rare then i could benefit from prior work, or compromise.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
so a bash script would just run git and pipe the output to a program that processes it and produces say a jsonlines format that data scientists are used to accepting

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
oh i could just read stdin

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
to speed up the data generator and make a mass of data, it would make sense to combine hist.bash with hist2json.py, and to generate the json manually rather than using a general lib, since the format is very simple. this could also be done in a compiled language, although i presently remember

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
[my current data generator is quite slow, unfortunately; but it's better than nothing]

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
another thing that would make sense here would be to just generate a bunch of data, and then share it with somebody interested in making a model that handles it

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
adapters do indeed support training embeddings. it is a parameter passed when enabling adapter training: https://github.com/adapter-hub/adapter-transformers/pull/245/files#diff-b31f98a320a05bd7744546d866cb04c4ac086ffae583745b969093c17d5cde6dR205 it looks like the trained embeddings are not then

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-24 Thread Undiscussed Past Horrific Abuse, One Victim Of Many
I bundled up some inputs (poorly), and extended the tokenizers to process symbols found in code, and continued with the same adapter even though those things had changed. The loss isn't dropping very fast anymore. There's also a bug I'm running into, where the code I did not write is just

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i'm having trouble continuing to work on this, but i seem able to let it train for a bit maybe i can try to keep it training i'm thinking a little of trying the other model types, too, dunno what i thought would be good to do was bundling the commit files up together into the input stream, so

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
special tokens in these transformers: key: eos=end-of-stream, bos=beginning-of-stream longt5 eos_token='' unk_token='' pad_token='' note: longt5 says it uses pad as bos xlnet bos_token='' eos_token='' unk_token='' sep_token='' pad_token='' cls_token='' transxl eos_token='' unk_token=''

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
having trouble focusing on combining the files into input. probably hesitating due to lack of knowledge of how much input the vm's gpu ram can hold. makes sense to separate the file data from the commit message data, so that an arbitrary number of files can be included similarly, it would be

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Groomed for Male Slavery, One Victim of Many
i ran it again until colab logged me out. the loss dropped to 0.7 or so. apparently colab lets you have a gpu again if you disable background execution. i'm running it some more, just for effective use of time. i looked into how longt5 works, and basically it locally contextualises the regions

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Horrific Abuse, One Victim of Many
insufficient explanation

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Horrific Abuse, One Victim of Many
the final trained adapter is only a few megabytes large, despite the model being several gigabytes. a nagging part of me keeps considering the pretraining content of the model. i'm not sure what t5 models are trained but, i imagine a generative or masked model would have more general-purpose

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Horrific Abuse, One Victim of Many
so basically codefudge is some scripts to train an adapter based on history of a git repository. hist.bash breaks the git history into *.file files containing commit message and file content, and *.commit files that contain the file diff within the commit.at time of writing, each file changed

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-23 Thread Undiscussed Horrific Abuse, One Victim of Many
xlnet seems a more normative way to do this than longt5. notably the longt5 tokenizer doesn't include tokens for linebreaks. https://github.com/xloem/codefudge # CodeFudge ## Ingredients: 1. one git repository of your choosing, with the same parent folder as this one 2. bash, python, pytorch,

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
below is as far as i got. when i tried running it on xsum, i ran out of gpu ram. i imagine there are a number of ways to address that. i'm thinking a simplest approach might be to use a smaller model, even if it doesn't have the long context support. another idea is to produce a shorter set of

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
the readme says that the first column is the input, and the second column is the output. sounds reasonable. the script can also extract specific columns from csv or jsonlines input. a normal thing to do here (assuming the code works, i could test on one of their exampels), might be to train a

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
it's surprising that i did this! i did not review the implementation structures to see if i missed anything, once i got it to run i also did not implement parallelism, tests, nor documentation

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
during adapter training gpu ram usage was under 1700 MB

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
i actually did this. added adapters to longt5 in my repo branch, and successfully trained on a csv file consisting of lines equal to "a,b" . after 400 dataitems, loss dropped to around 0.6, eval loss was 0.0 i don't really know if it's _working_ or just looks like it is or something, nor how it

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
it turns out adapter-transformers actually has a mutated entire copy of the transformers repository inside it, and replaces this on the user's system. they update this regularly, but their current version does not have longt5. i added longt5 and forked the repo and pushed it. it doesn't work with

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Past Victim of Many
long-t5-tglobal-base is using 1539 MB nvidia GPU ram for me with float32 weights. I unloaded my desktop environment and it would perform simple forward inference for me.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
there appears to be useful information on expected ram requirements at https://huggingface.co/docs/transformers/v4.16.2/en/performance#model-weights

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
not enough gpu ram cool to learn about though

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
the example code doesn't work, but the model loads when i put it in a normal huggingface 'pipeline' class it appears to do forward inference on my cpu, haven't figured out how to put it on the gpu

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
something i could theoretically try could be to manually process the weights to reduce their precision

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
the model is 1G large i'm suspecting this implies it won't fit in my gpu ram here

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
i had to install the transformers package from git in order to use LongT5Model thats' likely something like pip3 install --upgrade git+https://github.com/huggingface/transformers

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
it looks like this is the smallest of the most official longt5 models: https://huggingface.co/google/long-t5-tglobal-base

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
long story short, from https://arxiv.org/pdf/2112.07916.pdf , transient global attention is a new approach to attention invented for longt5, which appears to reliably outperform local attention in the same architecture the appearance of that, combined with finetunings i saw on the huggingface hub

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
i am risking my focus in order to spend a little time learning what the difference might be between local attention and transient global attention in longt5 as a side note, it is notably that it looks like a little like people are using longt5 heavily.

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
i ended up searching huggingface's github for closed issues that mentioned 'transformer-xl' it showed me results for other long context transformers than just that one. the most recent it showed completed i think was longt5. i didn't find transformer-xl models in the huggingface hub; it was

Re: [ot][spam][crazy] adapters for semibalanced trees?

2022-07-22 Thread Undiscussed Horrific Abuse, One Victim of Many
i seem to be a little interested now in discerning if transformer xl will run on my local system there's a doc page at https://huggingface.co/docs/transformers/model_doc/transfo-xl . it starts off with this example: from transformers import TransfoXLConfig, TransfoXLModel # Initializing a