I am not able to provide any real data, but attached is a script that
imports random documents of X's with approx the distribution of sizes I
have in my real data. This is more or less the process that will be
used to import the real data, and demonstrates the bottle neck. I did
parallelize the writes to a single database.
Josh --
#!/usr/bin/ruby
require 'rubygems'
require 'couchrest'
require 'digest/md5'
require 'time'
require 'base64'
$NUM_PROCS = (ARGV.shift || 10).to_i
$NUM_DB = (ARGV.shift || 2).to_i
$NUM_RECS = (ARGV.shift || 100000).to_i
$NUM_BULK = (ARGV.shift || 50).to_i
$HOST = "111.111.111.111"
$DB_BASENAME = "test"
$URLS = [
"http://#{$HOST}:5985/#{$DB_BASENAME}_0",
"http://#{$HOST}:5985/#{$DB_BASENAME}_1",
"http://#{$HOST}:5986/#{$DB_BASENAME}_0",
"http://#{$HOST}:5986/#{$DB_BASENAME}_1",
]
records_per_process = $NUM_RECS / $NUM_PROCS
#create several datafiles of various sizes.
$DATA = [
"X" * 1000,
"X" * 2000,
"X" * 4500,
"X" * 5000,
"X" * 7000,
"X" * 10000
].map { |d| "||#{d}||" }.map{ |data| Base64.encode64(data).gsub(/\s/,'') }
$NUM_PROCS.times do |p_num|
fork do
db_num = p_num % $URLS.size
db = CouchRest.database!($URLS[db_num])
docs = []
records_per_process.times do |i|
doc = {
'_attachments' => {
"text.txt" => {
'data' => $DATA[rand($DATA.size)],
},
},
#:sum => Digest::MD5.hexdigest(data)
}
docs << doc
if docs.size >= $NUM_BULK
db.bulk_save docs
docs = []
end
end
end
end
Process.waitall
Damien Katz wrote:
> Did you parallelize writes to a single database? Attachments are
> written in parallel, which should help you in this instance.
>
> -Damien
>
>
> On Jan 9, 2009, at 7:20 PM, Josh Bryan wrote:
>
>> Yes.
>>
>> Damien Katz wrote:
>>> Are you using bulk updates?
>>>
>>> -Damien
>>>
>>> On Jan 9, 2009, at 7:12 PM, Josh Bryan wrote:
>>>
>>>>
>>>> On a dual core pentium 3.0ghz with erlang 5.6 and couch 0.8.0 *using
>>>> bulk*
>>>> writes, I get throughput of 95 writes / second .
>>
>