Re: [twitter-dev] Re: Snowflake: An update and some very important information

M. Edward (Ed) Borasky Tue, 19 Oct 2010 11:04:32 -0700

With all due respect, the root of the problem is that "computerscientists" think in terms of abstract machines with infinitely-wideregisters, infinitely many addressable RAM cells, etc., and "businesspeople" think in terms of human populations and their tweet ratesgrowing geometrically for all time. Journalists believe neither ofthese. ;-) And neither assumption is realistic, which is why we haveto make decisions like this from time to time, and why sometimes wepredict disasters like Y2K or 32-bit machines crumbling in 2038 thatdon't actually happen. ;-)

So - for Twitter: what is your *realistic* projection for when a53-bit integer ID will overflow? What are the underlying assumptionsabout human population growth, spread of Twitter, revenue models,competition, etc.? I know this is all highly confidential, so for sakeof argument, assume current tweet rates per user and the goal yourexecutives have stated of a billion users, with a plateau at thatpoint. The question I'm asking is whether you *really* need 64-bitinteger IDs for tweets or for users. ;-)

By the way, I ask similar questions of all the "big data" geeks outthere - so many naked emperors, so little time. ;-)


--
M. Edward (Ed) Borasky
http://borasky-research.net http://twitter.com/znmeb

"A mathematician is a device for turning coffee into theorems." - Paul Erdos


Quoting Craig Hockenberry <craig.hockenbe...@gmail.com>:

This approach feels wrong to me. The red flag is the duplication of
data within the payload: in 30+ years of professional development,
I've never seen that work out well.

The root of the problem is that you've chosen to deliver data in a
format (JSON) that can't support integers with a value greater than
2^53 bits. And some of your data uses 2^64 bits.

The result is that you're working around the problem in a language by
using a string. Avoiding the root problem will encumber you with
legacy that you'll regret later.

Look at your proposed solution from a different point-of-view: say you
have a language that can't handle Unicode well (e.g. BASIC or Ruby.)
Would you solve this problem by adding another field called
"text_ascii"?

"text": "@themattharris hey how are things in København?".
"text_ascii": "@themattharris hey how are things in Kobenhavn?".

Seems silly, yet that is exactly what you're doing for Javascript and
long integers.

A part of this legacy in your payload is future confusion for
developers. Someone new to the Twitter API is going to be confused as
to why your ID values have both numeric and string representations.
And smart developers are going to lean towards the numeric
representation:

* 8 bytes of storage for 10765432100123456789 instead of 20 bytes.
* Faster sorting (less data to compare.)
* Correct sorting: "011" and "10" have different order depending on
whether you're sorting the string or numeric representation.

They'll eventually pay the price for choosing incorrectly.

Every ID in the API is going to need documentation as a result. For
example, are place IDs affected by this change? And what about the IDs
returned by the Search API? (there's no mention of "since_id_str" and
"max_id_str" above.)

Losing consistency with the XML format is also a problem. Unless
you're planning on adding _str elements to the XML payload, you're
presenting developers with a one-way street. A consumer of JSON
"id_str" can't  easily change the format of data they want to consume.

In my mind, you really only have two good choices at this point:

1) Limit Snowflake's ID space to 2^53 bits. Easier for developers,
harder for Twitter.

2) Make all Twitter IDs into strings. Easier for Twitter, harder for
developers.

The second choice is obviously more disruptive, but if you really need
the ID space, it's the right one. Even if it means I need to make
major changes to my code.


On Oct 18, 5:19 pm, Matt Harris <thematthar...@twitter.com> wrote:

Last week you may remember Twitter planned to enable the new Status ID
generator - 'Snowflake' but didn't. The purpose of this email is to explain
the reason why this didn't happen, what we are doing about it, and what the
new release plan is.

So what is Snowflake?
------------------------------
Snowflake is a service we will be using to generate unique Tweet IDs. These
Tweet IDs are unique 64bit unsigned integers, which, instead of being
sequential like the current IDs, are based on time. The full ID is composed
of a timestamp, a worker number, and a sequence number.

The problem
-----------------
Before launch it came to our attention that some programming languages such
as Javascript cannot support numbers with >53bits. This can be easily
examined by running a command similar to: (90071992547409921).toString() in
your browsers console or by running the following JSON snippet through your
JSON parser.

    {"id": 10765432100123456789, "id_str": "10765432100123456789"}

In affected JSON parsers the ID will not be converted successfully and will
lose accuracy. In some parsers there may even be an exception.

The solution
----------------
To allow javascript and JSON parsers to read the IDs we need to include a
string version of any ID when responding in the JSON format. What this means
is Status, User, Direct Message and Saved Search IDs in the Twitter API will
now be returned as an integer and a string in JSON responses. This will
apply to the main Twitter API, the Streaming API and the Search API.

For example, a status object will now contain an id and an id_str. The
following JSON representation of a status object shows the two versions of
the ID fields for each data point.

[
  {
    "coordinates": null,
    "truncated": false,
    "created_at": "Thu Oct 14 22:20:15 +0000 2010",
    "favorited": false,
    "entities": {
      "urls": [
      ],
      "hashtags": [
      ],
      "user_mentions": [
        {
          "name": "Matt Harris",
          "id": 777925,
          "id_str": "777925",
          "indices": [
            0,
            14
          ],
          "screen_name": "themattharris"
        }
      ]
    },
    "text": "@themattharris hey how are things?",
    "annotations": null,
    "contributors": [
      {
        "id": 819797,
        "id_str": "819797",
        "screen_name": "episod"
      }
    ],
    "id": 12738165059,
    "id_str": "12738165059",
    "retweet_count": 0,
    "geo": null,
    "retweeted": false,
    "in_reply_to_user_id": 777925,
    "in_reply_to_user_id_str": "777925",
    "in_reply_to_screen_name": "themattharris",
    "user": {
      "id": 6253282
      "id_str": "6253282"
    },
    "source": "web",
    "place": null,
    "in_reply_to_status_id": 12738040524
    "in_reply_to_status_id_str": "12738040524"
  }
]

What should you do - RIGHT NOW
----------------------------------------------
The first thing you should do is attempt to decode the JSON snippet above
using your production code parser. Observe the output to confirm the ID has
not lost accuracy.

What you do next depends on what happens:

* If your code converts the ID successfully without losing accuracy you are
OK but should consider converting to the _str versions of IDs as soon as
possible.
* If your code has lost accuracy, convert your code to using the _str
version immediately. If you do not do this your code will be unable to
interact with the Twitter API reliably.
* In some language parsers, the JSON may throw an exception when reading the
ID value. If this happens in your parser you will need to ‘pre-parse’ the
data, removing or replacing ID parameters with their _str versions.

Summary
-------------
1) If you develop in Javascript, know that you will have to update your code
to read the string version instead of the integer version.

2) If you use a JSON decoder, validate that the example JSON, above, decodes
without throwing exceptions. If exceptions are thrown, you will need to
pre-parse the data. Please let us know the name, version, and language of
the parser which throws the exception so we can investigate.

Timeline
-----------
by 22nd October 2010 (Friday): String versions of ID numbers will start
appearing in the API responses
4th November 2010 (Thursday) : Snowflake will be turned on but at ~41bit
length
26th November 2010 (Friday) : Status IDs will break 53bits in length and
cease being usable as Integers in Javascript based languages

We understand this isn’t as seamless a transition as we had planned and
appreciate for some of you this change requires an update to your code.
We’ve tried to give as much time as possible for you to make the migration
and update your code to use the new string representations.

Our own products and tools are affected by the change and we will be making
available any pre-parsing snippets we have created to ensure code continues
to work with the new IDs.

Thanks for your support and understanding.

---
@themattharris
Developer Advocate, Twitterhttp://twitter.com/themattharris


--
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list

Change your membership to this group:http://groups.google.com/group/twitter-development-talk




--
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk

Re: [twitter-dev] Re: Snowflake: An update and some very important information

Reply via email to