I reprocessed a million files and wrote proper UTF-8 csv files.  This
did away with any risk of me botching something via copy/paste from
stdout.

https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz

On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <[email protected]> wrote:
>
> Hi Tim,
>
> I would expect that many strange keys are actually present in the source 
> data, and are not due to an error somewhere in Tika or its dependencies. 
> Although mboxparser could have an issue somewhere.
>
> But it might be an idea to map some bad keys to their proper counterpart, 
> such as keywords, content-type and friends.
>
> Regards,
> Markus
>
> Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <[email protected]>:
>>
>> Thank you, Markus, for looking through these sheets.  There's a chance
>> I botched the encodings in transferring data from one location to
>> another.  Let me take another look, and yes, we've got to make some
>> improvements to the mbox parser.
>>
>> More digging for me to do on the data and your findings!
>>
>> Thank you!
>>
>> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
>> <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > These aggregations of large real world sets are always interesting to look 
>> > through. Especially because they are bound to have a lot of garbage and 
>> > peculiarities. There are probably some badly chosen key names, and very 
>> > likely many programming errors.
>> >
>> > Some interesting examples:
>> >
>> > what is this:
>> > Выберите_расширение_для_паковки
>> >
>> > the usual mixing of double-colon variants, there are also many escaped 
>> > quotes:
>> > ”keywords” and \"keywords\"
>> >
>> > these two are identical, but given a large enough set, they might not be:
>> > height 512205
>> > width 512205
>> >
>> > mboxparser spews out a lot of garbage, incredible:
>> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
>> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
>> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
>> >
>> > really, it does:
>> > MboxParser-_blank">http 3
>> > MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
>> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
>> >
>> > non-Latin scripts are expected, this is simplified Chinese:
>> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style 
>> > (?))
>> >
>> > perhaps shortest possible key name:
>> > T 4
>> >
>> > mboxparser, again, this time with XML tags:
>> > MboxParser-ype>state</span></font></st1:placetype></st1 4
>> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4
>> >
>> > the set seems to contain stuff from adult sites:
>> > xhamster-site-verification
>> >
>> > for some reason, the Dutch government always pops up in large sets:
>> > custom:OVERHEID.Informatietype/DC.type  13
>> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
>> >
>> > there are 18 different ways to spell/use Content-Type, of which four are, 
>> > of course, with mboxparser:
>> > Content-Type    6612729
>> > content_type    14
>> > \"Content-Type\"        9
>> > \"content-type\"        5
>> >
>> > the inevitable encoding error:
>> > pdf:docinfo:custom:-ý§ Q 10
>> > pagerankâ„¢ 50
>> >
>> > what.is.this:
>> > Laisv371DiskusijuIrK363rybosForumas 4
>> >
>> > hey, another contenter for the shortest key name:
>> > M 4
>> >
>> > there are 67 unique dcterms key names, but their counts are not very high:
>> > DCTERMS.title   44
>> > dcterms.title   26
>> > dcterms:title   13
>> > dcterms.Title   3
>> >
>> > there is also a Content-Type in Russian:
>> > Тип-содержимое 3
>> >
>> > someone wants to remove your dust:
>> > Dust_Removal_Data 339
>> >
>> > there are 908 unique unknown tags, no idea what that is:
>> > Exif_IFD0:Unknown_tag_(0x8482)  36
>> > Unknown_tag_(0x00bf)    36
>> > Exif_SubIFD:Unknown_tag_(0x9009)        35
>> > Unknown_tag_(0x00a0)    35
>> > Unknown_tag_(0x050e)    35
>> >
>> > ah, the winner of the shortest key name (line 2235):
>> > 71
>> >
>> > longest key, guess who:
>> > MboxParser-http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
>> >         3
>> >
>> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six 
>> > most frequently used Arabic symbols are not present. I wonder why. But 
>> > there is an RTL-script present, Hebrew. It is always strange to meet 
>> > terms/wors of RTL-scripts in an otherwise general LTR-world.
>> >
>> > I was a bit disappointed not to find any obscene terms. The set seemed to 
>> > be large enough for at least some general curse words.
>> >
>> > MboxParser is the real winner with 1763 unique keys, this is really absurd!
>> >
>> > Thanks, this was fun!
>> > Markus
>> >
>> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <[email protected]>:
>> >>
>> >> All,
>> >>
>> >>   I recently extracted metadata keys from 1 million files in our
>> >> regression corpus and did a group by.  This allows insight into common
>> >> metadata keys.
>> >>
>> >>   I've included two views, one looks at overall counts, and the other
>> >> breaks down metadata keys by mime type.
>> >>
>> >>   Please let us know if you find anything interesting or have any 
>> >> questions.
>> >>
>> >> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
>> >> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
>> >>
>> >>    Best,
>> >>
>> >>             Tim

Reply via email to