Re: Jena hangs while reading HTTP stream

Andy Seaborne Fri, 16 Jul 2021 03:26:03 -0700

Hi John - thanks for the update.

    Andy


On 16/07/2021 09:31, John Walker wrote:

Hi

After introducing the 120s timeout at client, we are no longer seeing the 
'hanging' processes, so I don't think there is any bug in Jena.

Since the patch was introduced around 10 days ago we are seeing a few requests 
where the timeout is reached and the client closes the connection.
We'd expect those requests to typically complete in under 120s, so need to see 
with James why those spaecific requests are taking longer.
We'll do that offline as it is not really a concern for Jena.

John

-----Original Message-----
From: James Anderson <anderson.james.1...@gmail.com>
Sent: Saturday, July 10, 2021 10:51 PM
To: users@jena.apache.org
Subject: Re: Jena hangs while reading HTTP stream

On 2021-07-10, at 22:02:43, Andy Seaborne <a...@apache.org> wrote:

Hi John,



On 10/07/2021 17:03, John Walker wrote:

We're using a 120s timeout for all the requests, which should give plenty

of time for the query requests to complete in regular circumstances.


That's 120s at the server?


i understood that to describe a timeout which they have introduced into
some stage in the client process(es).

the logs in the served indicate that the two requests for which they have
provided identifiers completed.

were the proxy to time a request out the client would receive a 504. that
does not appear in its log.
were the sparql processor to time them out, the response would be a
generic 500. that also does not appear.

What happens if that goes off? The response is closed? (a Q for james)


were the request to time out, the upstream connection would be closed.
the proxy should close its client connection as a consequence.

As we use N-Triples, I was wondering if the N-Triples parser uses the

readLine method to read from the stream.

If there were some line that is not terminated with an EOL character,

might that cause this issue?


 From the information so far, not likely. The parse has not changed and this

parsing path is well trodden.


What matter is that it's three terms then a DOT until end-of-stream is seen.

End-of-stream happens when the chunking transfer layer says so - that's in

Apache HttpClient.

one can expect the response to have been chunked.
the proxy is not configured to cache responses.


If you see a single CPU thread at 100%, the parser is looping but there isn't

a loop except delivering triples to the graph.  And the NT parse is well-used
and quite simple.


so it seems to be it is one of two cases:

1 - bytes are flowing but the parser can't send output triples in the

destination - java heap pressure (you'll see multiple CPU threads at 100%)


How long do you leave it? Eventually - many minutes (20 is possible) - this

case will out-of-memory.


55Mbytes isn't a very large number of triples. 500K maybe (without

knowing the data, rule of thumb 100 bytes per N-triple triple) and it's a
freshly create graph.  heap has been used up by the rest of the application?

the log entries for the two researched requests indicated 398940 and 398914
statements in the respective responses.


2 - Bytes are not flowing into the application, the parser is waiting. CPU

usage 0%.


The next question is whether the same operation will fault again or if the

same requestURL sometimes works.


The Jena code for all this is deterministic. There's no hidden parallelism in

this case.


----

HTTP is layered:

Transfer-Encoding [lowest level]
Content-Encoding
Actual stuff (Content-type).

Transfer is point-to-point, Content-Encoding is end-to-end.
"Transfer-Encoding: chunked" is used for a stream of response bytes

without Content-Length.


What intermediaries are there between the app and Dydra? There is an

nginx but is that acting as a reverse proxy (and what connection method
does it use) or is Dydra providing an nginx module?


Presumably there is a load balancer.

Does the app talk to a gateway?

Do any intermediaries cache?

Each hop between systems is a point-to-point "transfer".

Model sub = ModelFactory.createDefaultModel();
try (TypedInputStream stream = HttpOp.execHttpGet(requestURL,

WebContent.contentTypeNTriples, createHttpClient(auth), null)) {

    // The following part sometimes hangs:


Are you positive it returns from HttpOp.execHttpGet and enters the parser?

    RDFParser.create()
        .source(stream)
        .lang(Lang.NTRIPLES)
        .errorHandler(ErrorHandlerFactory.errorHandlerStrict)
        .parse(sub.getGraph());
} catch (Exception ex) {
    // Handle the exception
}



The parser does not use readLine.  NT, TTL, NQ, TriG share a tokenizer and

tokens get read from the character stream.  Whitespace is discarded. The NT
parsing is slightly permissive here (but you can't have """-strings).


End of (chunked) stream will happen and that is end-of-triples. The TCP

connection is still open but the response stream has ended. TCP is handled
by Apache HttpClient - that's quite unlikely to be broken.

Otherwise if the output stream from Dydra is not closed, or the socket is

not closed?


As james says - chunking happens (transfer presumably), and if the server

say "end of response", a chunk of zero bytes is sent which makes the end of
the content.  If each transfer is chunked, then all close this response but the
TCP connection is kept open.


HTTP is not a simple protocol!

The version of Apache Client (v4) does not support HTTP/2 so protocol

upgrade in HTTP/2 is not happening. We in HTTP/1.1.


    Andy

Re: Jena hangs while reading HTTP stream

Reply via email to