Re: replication error

Adam Kocoloski Sun, 01 Feb 2009 10:02:17 -0800

On Feb 1, 2009, at 11:52 AM, Sho Fukamachi wrote:

On 02/02/2009, at 2:23 AM, Adam Kocoloski wrote:
Hi Sho, are you getting req_timedout errors in the logs? It seemsa little weird to me that ibrowse starts the timer while it's stillsending data to the server; perhaps there's an alternative wehaven't noticed.
Yeah. Like this:
[<0.166.0>] retrying couch_rep HTTP post request due to {error,req_timedout}: "http://localhost:2808/media/_bulk_docs";
then bombs out:
[error] [emulator] Error in process <0.166.0> with exit value:{{badmatch,ok},[{couch_rep,update_docs,4},{couch_rep,save_docs_buffer,3}]}
After the 10 retries it gives an error report but I assume you knowwhat it says .. if not I can post it.


No need, I understand what's going on.

Anyway, it finishes eventually, just needs a lot of babysitting.
There's no way to change the request timeout or bulk docs size atruntime right now, but if you don't mind digging into the sourceyourself you change these as follows:
1) request timeout -- line 187 of couch_rep.erl looks like

case ibrowse:send_req(Url, Headers, Action, Body, Options) of
You can add a timeout in milliseconds as a sixth parameter toibrowse:send_req. The default is 30000. I think the atom'infinity' also works.
OK, I tried this. Unfortunately I have no idea what I am doing inErlang so completely screwed it up. I made it this:
case ibrowse:send_req(Url, Headers, Action, Body, Options, 120000)

Compiles fine but now throws this error if I try to replicate:
[error] [<0.50.0>] Uncaught error in HTTP request: {error,{badmatch,undefined}}
No doubt every Erlang programmer here wants to punch me for doingsomething that dumb, but putting that aside for the moment .. anyhints? : )

That's odd. I tried setting a 120 second timeout and didn't have anytrouble. Then again, I only ran the test suite; I didn't actuallyforce a timeout to occur or anything. Sorry, I don't have any hintsat the moment.

2) bulk_docs size -- The number "100" is mentioned three times incouch_rep:get_doc_info_list/2. You can lower that to somethingthat works better for you.
Well, a change in 1 place seems better than in 3 places .. I'llstick to the timeout for now.
My feeling is that CouchDB should probably start reducing the bulkdocs size, or increasing the timeout, or both, automatically when ithits a timeout error - or making them configurable in local.ini. Asdiscussed here before, people are using Couch to store largishattachments, and this is an intended use, so this kind of thing willdefinitely come up again. Or, of course, if the upcoming multipartfeature will solve all of this, then .. not, heh.

Multipart won't solve the problem where ibrowse throws a timeout erroreven while it's still sending data. That seems like a pretty curiouschoice on ibrowse' part to me. Maybe when I have some more free timeI can look into the timeout algo and see if it can be tweaked so thatit only starts after the request has been fully transmitted. I thinkthat would pretty much solve this problem. Barring that, I agree thatsome sort of back-off algorithm that lengthens the timeout after eachfailed request is warranted.

There's also one more knob we can turn. During replication we arechecking the memory consumption of the process collecting docs to sendto the target. If it hits 10MB we send the bulk immediately,regardless of whether it's 1 doc, 10, or 99. 10MB may be much toohigh given a 30 second timeout window in which we have to transmit thedata; 1MB is possibly a better fit for home broadband users. If youwant to fiddle with that knob instead of the ibrowse timeout you cantry changing line 224 of couch_rep.erl so that instead of


couch_util:should_flush()

it would read (value is in bytes)

couch_util:should_flush(1000000)

I don't have a strong opinion at this point in time about how many ofthese parameters ought to be tunable in local.ini. Best,


Adam



Thanks a lot for the help ..

Sho

Re: replication error

Reply via email to