I shouldn't be, but I'm disheartened by how many metadata keys are not name-spaced. I don't think we can do anything with these in 2.x, but for 3.x, we should be thinking about namespacing all the keys that don't have natural dc: or other standards.
I'm also, frankly, bewildered by the amount of custom / non-standard metadata. Again, this shouldn't surprise me, but...wow. https://issues.apache.org/jira/browse/TIKA-3872 On Fri, Oct 7, 2022 at 9:23 AM Tim Allison <[email protected]> wrote: > > Is there anything that leaps out that needs attention? > > On Fri, Oct 7, 2022 at 7:12 AM Markus Jelsma <[email protected]> > wrote: > > > > Ah, there are some differences this time, except for MboxParser, of course > > :) > > > > Very nice to see this happening, it wasn't present/noticed in the other set > > tiff:ImageWidth,727519 > > tiff:ImageLength,727512 > > > > There are this time also quite a few with whitespaces in the keys: > > Dimension HorizontalPixelSize,166272 > > Dimension VerticalPixelSize,166272 > > > > Attempts to do some Javascript: > > <script>,1 > > var gcse = document.createElement(,1 > > var s = document.getElementsByTagName(,1 > > > > Something that appears to be a 'tag cloud' of a Dutch blog about travelling > > to Thailand: > > "thailand,thailand forum,bangkok,chiang > > mai,vakantie,accommodatie,hotel,surat > > thani,tuktuk,eiland,krabi,phuket,sukothai,phi phi,khao > > sok,guesthouse,national > > park,isaan,monnik,samui,panghan,bergvolk,eiland,trein,vliegtuig,ayutthaya,visum,thai,sawasdee,tempelflower",1 > > > > More tag clouds: > > "homoeopathy, homopati, homeopathy, homeopati, hormon, alopaty, allopaty, > > alopati, biochemic, biokemik, biokimia",1 > > "homopati, homopathy, homeopati, homoeopati, biochemic, biokimia",1 > > > > Chinese, Cyrillic and Arabic mixed with Latin. Especially Arabic is weird > > when displayed correctly with the ,1 on its left: > > custom:Шифр,1 > > custom:тавсўф,1 > > custom:آموزش ایندیزاین,1 > > custom:关键字,1 > > > > Escaping gone mad: > > "\""content-type\""",5 > > > > There are also e-mail addresses that i am not going to put down here. And i > > must say, after looking through it, MboxParser did still surprise me. > > > > Thanks, > > Markus > > > > Op do 6 okt. 2022 om 17:27 schreef Tim Allison <[email protected]>: > >> > >> I reprocessed a million files and wrote proper UTF-8 csv files. This > >> did away with any risk of me botching something via copy/paste from > >> stdout. > >> > >> https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz > >> > >> On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <[email protected]> > >> wrote: > >> > > >> > Hi Tim, > >> > > >> > I would expect that many strange keys are actually present in the source > >> > data, and are not due to an error somewhere in Tika or its dependencies. > >> > Although mboxparser could have an issue somewhere. > >> > > >> > But it might be an idea to map some bad keys to their proper > >> > counterpart, such as keywords, content-type and friends. > >> > > >> > Regards, > >> > Markus > >> > > >> > Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <[email protected]>: > >> >> > >> >> Thank you, Markus, for looking through these sheets. There's a chance > >> >> I botched the encodings in transferring data from one location to > >> >> another. Let me take another look, and yes, we've got to make some > >> >> improvements to the mbox parser. > >> >> > >> >> More digging for me to do on the data and your findings! > >> >> > >> >> Thank you! > >> >> > >> >> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma > >> >> <[email protected]> wrote: > >> >> > > >> >> > Hi, > >> >> > > >> >> > These aggregations of large real world sets are always interesting to > >> >> > look through. Especially because they are bound to have a lot of > >> >> > garbage and peculiarities. There are probably some badly chosen key > >> >> > names, and very likely many programming errors. > >> >> > > >> >> > Some interesting examples: > >> >> > > >> >> > what is this: > >> >> > Выберите_расширение_для_паковки > >> >> > > >> >> > the usual mixing of double-colon variants, there are also many > >> >> > escaped quotes: > >> >> > ”keywords” and \"keywords\" > >> >> > > >> >> > these two are identical, but given a large enough set, they might not > >> >> > be: > >> >> > height 512205 > >> >> > width 512205 > >> >> > > >> >> > mboxparser spews out a lot of garbage, incredible: > >> >> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3 > >> >> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3 > >> >> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3 > >> >> > > >> >> > really, it does: > >> >> > MboxParser-_blank">http 3 > >> >> > MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz > >> >> > 3 > >> >> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3 > >> >> > > >> >> > non-Latin scripts are expected, this is simplified Chinese: > >> >> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular > >> >> > style (?)) > >> >> > > >> >> > perhaps shortest possible key name: > >> >> > T 4 > >> >> > > >> >> > mboxparser, again, this time with XML tags: > >> >> > MboxParser-ype>state</span></font></st1:placetype></st1 4 > >> >> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 > >> >> > 4 > >> >> > > >> >> > the set seems to contain stuff from adult sites: > >> >> > xhamster-site-verification > >> >> > > >> >> > for some reason, the Dutch government always pops up in large sets: > >> >> > custom:OVERHEID.Informatietype/DC.type 13 > >> >> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType 13 > >> >> > > >> >> > there are 18 different ways to spell/use Content-Type, of which four > >> >> > are, of course, with mboxparser: > >> >> > Content-Type 6612729 > >> >> > content_type 14 > >> >> > \"Content-Type\" 9 > >> >> > \"content-type\" 5 > >> >> > > >> >> > the inevitable encoding error: > >> >> > pdf:docinfo:custom:-ý§ Q 10 > >> >> > pagerankâ„¢ 50 > >> >> > > >> >> > what.is.this: > >> >> > Laisv371DiskusijuIrK363rybosForumas 4 > >> >> > > >> >> > hey, another contenter for the shortest key name: > >> >> > M 4 > >> >> > > >> >> > there are 67 unique dcterms key names, but their counts are not very > >> >> > high: > >> >> > DCTERMS.title 44 > >> >> > dcterms.title 26 > >> >> > dcterms:title 13 > >> >> > dcterms.Title 3 > >> >> > > >> >> > there is also a Content-Type in Russian: > >> >> > Тип-содержимое 3 > >> >> > > >> >> > someone wants to remove your dust: > >> >> > Dust_Removal_Data 339 > >> >> > > >> >> > there are 908 unique unknown tags, no idea what that is: > >> >> > Exif_IFD0:Unknown_tag_(0x8482) 36 > >> >> > Unknown_tag_(0x00bf) 36 > >> >> > Exif_SubIFD:Unknown_tag_(0x9009) 35 > >> >> > Unknown_tag_(0x00a0) 35 > >> >> > Unknown_tag_(0x050e) 35 > >> >> > > >> >> > ah, the winner of the shortest key name (line 2235): > >> >> > 71 > >> >> > > >> >> > longest key, guess who: > >> >> > MboxParser-http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps > >> >> > 3 > >> >> > > >> >> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But > >> >> > the six most frequently used Arabic symbols are not present. I wonder > >> >> > why. But there is an RTL-script present, Hebrew. It is always strange > >> >> > to meet terms/wors of RTL-scripts in an otherwise general LTR-world. > >> >> > > >> >> > I was a bit disappointed not to find any obscene terms. The set > >> >> > seemed to be large enough for at least some general curse words. > >> >> > > >> >> > MboxParser is the real winner with 1763 unique keys, this is really > >> >> > absurd! > >> >> > > >> >> > Thanks, this was fun! > >> >> > Markus > >> >> > > >> >> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <[email protected]>: > >> >> >> > >> >> >> All, > >> >> >> > >> >> >> I recently extracted metadata keys from 1 million files in our > >> >> >> regression corpus and did a group by. This allows insight into > >> >> >> common > >> >> >> metadata keys. > >> >> >> > >> >> >> I've included two views, one looks at overall counts, and the other > >> >> >> breaks down metadata keys by mime type. > >> >> >> > >> >> >> Please let us know if you find anything interesting or have any > >> >> >> questions. > >> >> >> > >> >> >> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz > >> >> >> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz > >> >> >> > >> >> >> Best, > >> >> >> > >> >> >> Tim
