Hi John,

Ah yes.  You've exposed my lack of experience with the lower level
socket library.  Sorry, I was making a fuss about nothing ;)

I moved to using a socket connection directly because I found that the
httplib based client tweepy was using tended to hang occasionally when
doing a low latency restart.  I debugged the httplib internals to see
it was hanging (maxing my CPU and making the process unresponsive on
an ubuntu linode and on OSX 10.4) at line 391 of python2.6's
httplib.py, in ``HTTPResponse.begin``::

    version, status, reason = self._read_status()

This was calling ``self.fp.readline()`` on the underlying socket.  I
wasn't sure if this was a Twitter issue or an httplib issue, so I re-
implemented a raw socket consumer so I could see exactly what was
coming down the pipe.

I've found the socket approach more reliable - I can't replicate the
same error in a fair amount of trying (although I don't have a
scientific way of triggering it with httplib, so this is a pretty
loose test).

The code I was using for a few weeks in production which tended to
hang, maxing CPU every so often when doing a low latency restart:

http://gist.github.com/332769

The raw socket version which seems to work, pending dealing with the
chunked encoding:

http://gist.github.com/332758
http://gist.github.com/332759

Thanks,

James.

On Mar 15, 2:58 am, John Kalucki <j...@twitter.com> wrote:
> You appear to be looking at the raw HTTP chunk transfer encoded stream. The
> documentation assumes that you are using a HTTP client, not the raw TCP
> stream. If you are using the raw TCP stream, you can try to play games and
> use the chunk encoding, but there are no guarantees that the chunks will
> always align with the payload.
>
> -John Kaluckihttp://twitter.com/jkalucki
> Infrastructure, Twitter Inc.
>
> On Sun, Mar 14, 2010 at 3:43 PM, thruflo <thru...@googlemail.com> wrote:
> > I'm consuming the Streaming API using the filter method (tracking some
> > user ids).  I've noticed that I'm getting an extra, undocumented, line
> > before each length delimiter.
>
> > I connect and get the following coming down the pipe:
>
> > {{{
>
> > HTTP/1.1 200 OK
> > Content-Type: application/json
> > Transfer-Encoding: chunked
> > Server: Jetty(6.1.17)
>
> > 5DE
> > 1496
> > {"coordinates":null, ... snip ..., "id":10487365330}
>
> > A52
> > 2636
> > {"coordinates":null, ...snip ..., "id":10487377907}
>
> > 592
> > 1420
> > {"coordinates":null, ... snip ..., "id":10487298462}
>
> > }}}
>
> > Now, the Streaming API docs say, "Statuses are represented by a
> > length, in bytes, a newline, and the status text that is exactly
> > length bytes. Note that "keep-alive" newlines may be inserted before
> > each length."
>
> > This suggests the following read loop code (based on and equivalent to
> > the way tweepy's consumer is implemented):
>
> > {{{
>
> > length = ''
> > while True:
> >    c = s.recv(1)
> >    if c == '\n':
> >        break
> >    length += c
> > length = length.strip()
> > if length.isdigit():
> >    length = int(length)
> >    status_data = s.recv(length)
> >    # do something with the data
>
> > }}}
>
> > However, if you look at the third status data from above, you see that
> > the extra line can sometimes be a digit, in that case ``592``.  Which
> > fairly effectively borkes the consumer.
>
> > Now, I can hack that read loop in quite a few ways to accomodate this
> > extra data coming down the pipe.  Question is, what's the best way to
> > do so?  Is this something I can rely on, e.g.: I can look for a line
> > above the length delimiter?  Will it always have three chars?  Do
> > statuses always have > 1000 bytes?
>
> > Plus I'm wondering whether this has always been the case, or if there
> > are broken consumers missing tweets out there?
>
> > Thanks,
>
> > James.
>
>

Reply via email to