Ah, there are some differences this time, except for MboxParser, of course
:)

Very nice to see this happening, it wasn't present/noticed in the other set
tiff:ImageWidth,727519
tiff:ImageLength,727512

There are this time also quite a few with whitespaces in the keys:
Dimension HorizontalPixelSize,166272
Dimension VerticalPixelSize,166272

Attempts to do some Javascript:
<script>,1
var gcse = document.createElement(,1
var s = document.getElementsByTagName(,1

Something that appears to be a 'tag cloud' of a Dutch blog about travelling
to Thailand:
"thailand,thailand forum,bangkok,chiang
mai,vakantie,accommodatie,hotel,surat
thani,tuktuk,eiland,krabi,phuket,sukothai,phi phi,khao
sok,guesthouse,national
park,isaan,monnik,samui,panghan,bergvolk,eiland,trein,vliegtuig,ayutthaya,visum,thai,sawasdee,tempelflower",1

More tag clouds:
"homoeopathy, homopati, homeopathy, homeopati, hormon, alopaty, allopaty,
alopati, biochemic, biokemik, biokimia",1
"homopati, homopathy, homeopati, homoeopati, biochemic, biokimia",1

Chinese, Cyrillic and Arabic mixed with Latin. Especially Arabic is weird
when displayed correctly with the ,1 on its left:
custom:Шифр,1
custom:тавсўф,1
custom:آموزش ایندیزاین,1
custom:关键字,1

Escaping gone mad:
"\""content-type\""",5

There are also e-mail addresses that i am not going to put down here. And i
must say, after looking through it, MboxParser did still surprise me.

Thanks,
Markus

Op do 6 okt. 2022 om 17:27 schreef Tim Allison <[email protected]>:

> I reprocessed a million files and wrote proper UTF-8 csv files.  This
> did away with any risk of me botching something via copy/paste from
> stdout.
>
> https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz
>
> On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma <[email protected]>
> wrote:
> >
> > Hi Tim,
> >
> > I would expect that many strange keys are actually present in the source
> data, and are not due to an error somewhere in Tika or its dependencies.
> Although mboxparser could have an issue somewhere.
> >
> > But it might be an idea to map some bad keys to their proper
> counterpart, such as keywords, content-type and friends.
> >
> > Regards,
> > Markus
> >
> > Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <[email protected]>:
> >>
> >> Thank you, Markus, for looking through these sheets.  There's a chance
> >> I botched the encodings in transferring data from one location to
> >> another.  Let me take another look, and yes, we've got to make some
> >> improvements to the mbox parser.
> >>
> >> More digging for me to do on the data and your findings!
> >>
> >> Thank you!
> >>
> >> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
> >> <[email protected]> wrote:
> >> >
> >> > Hi,
> >> >
> >> > These aggregations of large real world sets are always interesting to
> look through. Especially because they are bound to have a lot of garbage
> and peculiarities. There are probably some badly chosen key names, and very
> likely many programming errors.
> >> >
> >> > Some interesting examples:
> >> >
> >> > what is this:
> >> > Выберите_расширение_для_паковки
> >> >
> >> > the usual mixing of double-colon variants, there are also many
> escaped quotes:
> >> > ”keywords” and \"keywords\"
> >> >
> >> > these two are identical, but given a large enough set, they might not
> be:
> >> > height 512205
> >> > width 512205
> >> >
> >> > mboxparser spews out a lot of garbage, incredible:
> >> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
> >> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
> >> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
> >> >
> >> > really, it does:
> >> > MboxParser-_blank">http 3
> >> >
> MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
> >> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
> >> >
> >> > non-Latin scripts are expected, this is simplified Chinese:
> >> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular
> style (?))
> >> >
> >> > perhaps shortest possible key name:
> >> > T 4
> >> >
> >> > mboxparser, again, this time with XML tags:
> >> > MboxParser-ype>state</span></font></st1:placetype></st1 4
> >> >
> MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4
> >> >
> >> > the set seems to contain stuff from adult sites:
> >> > xhamster-site-verification
> >> >
> >> > for some reason, the Dutch government always pops up in large sets:
> >> > custom:OVERHEID.Informatietype/DC.type  13
> >> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
> >> >
> >> > there are 18 different ways to spell/use Content-Type, of which four
> are, of course, with mboxparser:
> >> > Content-Type    6612729
> >> > content_type    14
> >> > \"Content-Type\"        9
> >> > \"content-type\"        5
> >> >
> >> > the inevitable encoding error:
> >> > pdf:docinfo:custom:-ý§ Q 10
> >> > pagerankâ„¢ 50
> >> >
> >> > what.is.this:
> >> > Laisv371DiskusijuIrK363rybosForumas 4
> >> >
> >> > hey, another contenter for the shortest key name:
> >> > M 4
> >> >
> >> > there are 67 unique dcterms key names, but their counts are not very
> high:
> >> > DCTERMS.title   44
> >> > dcterms.title   26
> >> > dcterms:title   13
> >> > dcterms.Title   3
> >> >
> >> > there is also a Content-Type in Russian:
> >> > Тип-содержимое 3
> >> >
> >> > someone wants to remove your dust:
> >> > Dust_Removal_Data 339
> >> >
> >> > there are 908 unique unknown tags, no idea what that is:
> >> > Exif_IFD0:Unknown_tag_(0x8482)  36
> >> > Unknown_tag_(0x00bf)    36
> >> > Exif_SubIFD:Unknown_tag_(0x9009)        35
> >> > Unknown_tag_(0x00a0)    35
> >> > Unknown_tag_(0x050e)    35
> >> >
> >> > ah, the winner of the shortest key name (line 2235):
> >> > 71
> >> >
> >> > longest key, guess who:
> >> > MboxParser-
> http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
>       3
> >> >
> >> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But
> the six most frequently used Arabic symbols are not present. I wonder why.
> But there is an RTL-script present, Hebrew. It is always strange to meet
> terms/wors of RTL-scripts in an otherwise general LTR-world.
> >> >
> >> > I was a bit disappointed not to find any obscene terms. The set
> seemed to be large enough for at least some general curse words.
> >> >
> >> > MboxParser is the real winner with 1763 unique keys, this is really
> absurd!
> >> >
> >> > Thanks, this was fun!
> >> > Markus
> >> >
> >> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <[email protected]>:
> >> >>
> >> >> All,
> >> >>
> >> >>   I recently extracted metadata keys from 1 million files in our
> >> >> regression corpus and did a group by.  This allows insight into
> common
> >> >> metadata keys.
> >> >>
> >> >>   I've included two views, one looks at overall counts, and the other
> >> >> breaks down metadata keys by mime type.
> >> >>
> >> >>   Please let us know if you find anything interesting or have any
> questions.
> >> >>
> >> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> >> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
> >> >>
> >> >>    Best,
> >> >>
> >> >>             Tim
>

Reply via email to