Hi Tim,

I would expect that many strange keys are actually present in the source
data, and are not due to an error somewhere in Tika or its dependencies.
Although mboxparser could have an issue somewhere.

But it might be an idea to map some bad keys to their proper counterpart,
such as keywords, content-type and friends.

Regards,
Markus

Op ma 3 okt. 2022 om 17:10 schreef Tim Allison <[email protected]>:

> Thank you, Markus, for looking through these sheets.  There's a chance
> I botched the encodings in transferring data from one location to
> another.  Let me take another look, and yes, we've got to make some
> improvements to the mbox parser.
>
> More digging for me to do on the data and your findings!
>
> Thank you!
>
> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
> <[email protected]> wrote:
> >
> > Hi,
> >
> > These aggregations of large real world sets are always interesting to
> look through. Especially because they are bound to have a lot of garbage
> and peculiarities. There are probably some badly chosen key names, and very
> likely many programming errors.
> >
> > Some interesting examples:
> >
> > what is this:
> > Выберите_расширение_для_паковки
> >
> > the usual mixing of double-colon variants, there are also many escaped
> quotes:
> > ”keywords” and \"keywords\"
> >
> > these two are identical, but given a large enough set, they might not be:
> > height 512205
> > width 512205
> >
> > mboxparser spews out a lot of garbage, incredible:
> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
> >
> > really, it does:
> > MboxParser-_blank">http 3
> >
> MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
> > MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3
> >
> > non-Latin scripts are expected, this is simplified Chinese:
> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style
> (?))
> >
> > perhaps shortest possible key name:
> > T 4
> >
> > mboxparser, again, this time with XML tags:
> > MboxParser-ype>state</span></font></st1:placetype></st1 4
> > MboxParser-ype>university</span></font></st1:placetype></st1:place></st1
> 4
> >
> > the set seems to contain stuff from adult sites:
> > xhamster-site-verification
> >
> > for some reason, the Dutch government always pops up in large sets:
> > custom:OVERHEID.Informatietype/DC.type  13
> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13
> >
> > there are 18 different ways to spell/use Content-Type, of which four
> are, of course, with mboxparser:
> > Content-Type    6612729
> > content_type    14
> > \"Content-Type\"        9
> > \"content-type\"        5
> >
> > the inevitable encoding error:
> > pdf:docinfo:custom:-ý§ Q 10
> > pagerankâ„¢ 50
> >
> > what.is.this:
> > Laisv371DiskusijuIrK363rybosForumas 4
> >
> > hey, another contenter for the shortest key name:
> > M 4
> >
> > there are 67 unique dcterms key names, but their counts are not very
> high:
> > DCTERMS.title   44
> > dcterms.title   26
> > dcterms:title   13
> > dcterms.Title   3
> >
> > there is also a Content-Type in Russian:
> > Тип-содержимое 3
> >
> > someone wants to remove your dust:
> > Dust_Removal_Data 339
> >
> > there are 908 unique unknown tags, no idea what that is:
> > Exif_IFD0:Unknown_tag_(0x8482)  36
> > Unknown_tag_(0x00bf)    36
> > Exif_SubIFD:Unknown_tag_(0x9009)        35
> > Unknown_tag_(0x00a0)    35
> > Unknown_tag_(0x050e)    35
> >
> > ah, the winner of the shortest key name (line 2235):
> > 71
> >
> > longest key, guess who:
> > MboxParser-
> http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
>       3
> >
> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the
> six most frequently used Arabic symbols are not present. I wonder why. But
> there is an RTL-script present, Hebrew. It is always strange to meet
> terms/wors of RTL-scripts in an otherwise general LTR-world.
> >
> > I was a bit disappointed not to find any obscene terms. The set seemed
> to be large enough for at least some general curse words.
> >
> > MboxParser is the real winner with 1763 unique keys, this is really
> absurd!
> >
> > Thanks, this was fun!
> > Markus
> >
> > Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <[email protected]>:
> >>
> >> All,
> >>
> >>   I recently extracted metadata keys from 1 million files in our
> >> regression corpus and did a group by.  This allows insight into common
> >> metadata keys.
> >>
> >>   I've included two views, one looks at overall counts, and the other
> >> breaks down metadata keys by mime type.
> >>
> >>   Please let us know if you find anything interesting or have any
> questions.
> >>
> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> >>
> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
> >>
> >>    Best,
> >>
> >>             Tim
>

Reply via email to