I reprocessed a million files and wrote proper UTF-8 csv files. This did away with any risk of me botching something via copy/paste from stdout.
https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <[email protected]> wrote: > > Hi Tim, > > I would expect that many strange keys are actually present in the source > data, and are not due to an error somewhere in Tika or its dependencies. > Although mboxparser could have an issue somewhere. > > But it might be an idea to map some bad keys to their proper counterpart, > such as keywords, content-type and friends. > > Regards, > Markus > > Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <[email protected]>: >> >> Thank you, Markus, for looking through these sheets. There's a chance >> I botched the encodings in transferring data from one location to >> another. Let me take another look, and yes, we've got to make some >> improvements to the mbox parser. >> >> More digging for me to do on the data and your findings! >> >> Thank you! >> >> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma >> <[email protected]> wrote: >> > >> > Hi, >> > >> > These aggregations of large real world sets are always interesting to look >> > through. Especially because they are bound to have a lot of garbage and >> > peculiarities. There are probably some badly chosen key names, and very >> > likely many programming errors. >> > >> > Some interesting examples: >> > >> > what is this: >> > Выберите_расширение_для_паковки >> > >> > the usual mixing of double-colon variants, there are also many escaped >> > quotes: >> > ”keywords” and \"keywords\" >> > >> > these two are identical, but given a large enough set, they might not be: >> > height 512205 >> > width 512205 >> > >> > mboxparser spews out a lot of garbage, incredible: >> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3 >> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3 >> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3 >> > >> > really, it does: >> > MboxParser-_blank">http 3 >> > MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3 >> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3 >> > >> > non-Latin scripts are expected, this is simplified Chinese: >> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style >> > (?)) >> > >> > perhaps shortest possible key name: >> > T 4 >> > >> > mboxparser, again, this time with XML tags: >> > MboxParser-ype>state</span></font></st1:placetype></st1 4 >> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4 >> > >> > the set seems to contain stuff from adult sites: >> > xhamster-site-verification >> > >> > for some reason, the Dutch government always pops up in large sets: >> > custom:OVERHEID.Informatietype/DC.type 13 >> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType 13 >> > >> > there are 18 different ways to spell/use Content-Type, of which four are, >> > of course, with mboxparser: >> > Content-Type 6612729 >> > content_type 14 >> > \"Content-Type\" 9 >> > \"content-type\" 5 >> > >> > the inevitable encoding error: >> > pdf:docinfo:custom:-ý§ Q 10 >> > pagerankâ„¢ 50 >> > >> > what.is.this: >> > Laisv371DiskusijuIrK363rybosForumas 4 >> > >> > hey, another contenter for the shortest key name: >> > M 4 >> > >> > there are 67 unique dcterms key names, but their counts are not very high: >> > DCTERMS.title 44 >> > dcterms.title 26 >> > dcterms:title 13 >> > dcterms.Title 3 >> > >> > there is also a Content-Type in Russian: >> > Тип-содержимое 3 >> > >> > someone wants to remove your dust: >> > Dust_Removal_Data 339 >> > >> > there are 908 unique unknown tags, no idea what that is: >> > Exif_IFD0:Unknown_tag_(0x8482) 36 >> > Unknown_tag_(0x00bf) 36 >> > Exif_SubIFD:Unknown_tag_(0x9009) 35 >> > Unknown_tag_(0x00a0) 35 >> > Unknown_tag_(0x050e) 35 >> > >> > ah, the winner of the shortest key name (line 2235): >> > 71 >> > >> > longest key, guess who: >> > MboxParser-http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps >> > 3 >> > >> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six >> > most frequently used Arabic symbols are not present. I wonder why. But >> > there is an RTL-script present, Hebrew. It is always strange to meet >> > terms/wors of RTL-scripts in an otherwise general LTR-world. >> > >> > I was a bit disappointed not to find any obscene terms. The set seemed to >> > be large enough for at least some general curse words. >> > >> > MboxParser is the real winner with 1763 unique keys, this is really absurd! >> > >> > Thanks, this was fun! >> > Markus >> > >> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <[email protected]>: >> >> >> >> All, >> >> >> >> I recently extracted metadata keys from 1 million files in our >> >> regression corpus and did a group by. This allows insight into common >> >> metadata keys. >> >> >> >> I've included two views, one looks at overall counts, and the other >> >> breaks down metadata keys by mime type. >> >> >> >> Please let us know if you find anything interesting or have any >> >> questions. >> >> >> >> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz >> >> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz >> >> >> >> Best, >> >> >> >> Tim
