[twitter-dev] Re: Additional Delimiter

2010-03-15 Thread thruflo
Hi John,

Ah yes.  You've exposed my lack of experience with the lower level
socket library.  Sorry, I was making a fuss about nothing ;)

I moved to using a socket connection directly because I found that the
httplib based client tweepy was using tended to hang occasionally when
doing a low latency restart.  I debugged the httplib internals to see
it was hanging (maxing my CPU and making the process unresponsive on
an ubuntu linode and on OSX 10.4) at line 391 of python2.6's
httplib.py, in ``HTTPResponse.begin``::

version, status, reason = self._read_status()

This was calling ``self.fp.readline()`` on the underlying socket.  I
wasn't sure if this was a Twitter issue or an httplib issue, so I re-
implemented a raw socket consumer so I could see exactly what was
coming down the pipe.

I've found the socket approach more reliable - I can't replicate the
same error in a fair amount of trying (although I don't have a
scientific way of triggering it with httplib, so this is a pretty
loose test).

The code I was using for a few weeks in production which tended to
hang, maxing CPU every so often when doing a low latency restart:

http://gist.github.com/332769

The raw socket version which seems to work, pending dealing with the
chunked encoding:

http://gist.github.com/332758
http://gist.github.com/332759

Thanks,

James.

On Mar 15, 2:58 am, John Kalucki j...@twitter.com wrote:
 You appear to be looking at the raw HTTP chunk transfer encoded stream. The
 documentation assumes that you are using a HTTP client, not the raw TCP
 stream. If you are using the raw TCP stream, you can try to play games and
 use the chunk encoding, but there are no guarantees that the chunks will
 always align with the payload.

 -John Kaluckihttp://twitter.com/jkalucki
 Infrastructure, Twitter Inc.

 On Sun, Mar 14, 2010 at 3:43 PM, thruflo thru...@googlemail.com wrote:
  I'm consuming the Streaming API using the filter method (tracking some
  user ids).  I've noticed that I'm getting an extra, undocumented, line
  before each length delimiter.

  I connect and get the following coming down the pipe:

  {{{

  HTTP/1.1 200 OK
  Content-Type: application/json
  Transfer-Encoding: chunked
  Server: Jetty(6.1.17)

  5DE
  1496
  {coordinates:null, ... snip ..., id:10487365330}

  A52
  2636
  {coordinates:null, ...snip ..., id:10487377907}

  592
  1420
  {coordinates:null, ... snip ..., id:10487298462}

  }}}

  Now, the Streaming API docs say, Statuses are represented by a
  length, in bytes, a newline, and the status text that is exactly
  length bytes. Note that keep-alive newlines may be inserted before
  each length.

  This suggests the following read loop code (based on and equivalent to
  the way tweepy's consumer is implemented):

  {{{

  length = ''
  while True:
     c = s.recv(1)
     if c == '\n':
         break
     length += c
  length = length.strip()
  if length.isdigit():
     length = int(length)
     status_data = s.recv(length)
     # do something with the data

  }}}

  However, if you look at the third status data from above, you see that
  the extra line can sometimes be a digit, in that case ``592``.  Which
  fairly effectively borkes the consumer.

  Now, I can hack that read loop in quite a few ways to accomodate this
  extra data coming down the pipe.  Question is, what's the best way to
  do so?  Is this something I can rely on, e.g.: I can look for a line
  above the length delimiter?  Will it always have three chars?  Do
  statuses always have  1000 bytes?

  Plus I'm wondering whether this has always been the case, or if there
  are broken consumers missing tweets out there?

  Thanks,

  James.




[twitter-dev] Additional Delimiter

2010-03-14 Thread thruflo
I'm consuming the Streaming API using the filter method (tracking some
user ids).  I've noticed that I'm getting an extra, undocumented, line
before each length delimiter.

I connect and get the following coming down the pipe:

{{{

HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.17)

5DE
1496
{coordinates:null, ... snip ..., id:10487365330}

A52
2636
{coordinates:null, ...snip ..., id:10487377907}

592
1420
{coordinates:null, ... snip ..., id:10487298462}


}}}

Now, the Streaming API docs say, Statuses are represented by a
length, in bytes, a newline, and the status text that is exactly
length bytes. Note that keep-alive newlines may be inserted before
each length.

This suggests the following read loop code (based on and equivalent to
the way tweepy's consumer is implemented):

{{{

length = ''
while True:
c = s.recv(1)
if c == '\n':
break
length += c
length = length.strip()
if length.isdigit():
length = int(length)
status_data = s.recv(length)
# do something with the data

}}}

However, if you look at the third status data from above, you see that
the extra line can sometimes be a digit, in that case ``592``.  Which
fairly effectively borkes the consumer.

Now, I can hack that read loop in quite a few ways to accomodate this
extra data coming down the pipe.  Question is, what's the best way to
do so?  Is this something I can rely on, e.g.: I can look for a line
above the length delimiter?  Will it always have three chars?  Do
statuses always have  1000 bytes?

Plus I'm wondering whether this has always been the case, or if there
are broken consumers missing tweets out there?

Thanks,

James.