Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-13 Thread Tanya Unterberger
Thanks, Sean.

It is good to know what the limitations are. And good that I made a mistake
at the start and we kind of have a work around...

On 13 October 2016 at 16:24, Sean Beckett  wrote:

> Tonya, when you write the data in ms but don't specify the precision, the
> database interprets those millisecond timestamps as nanoseconds, and all
> the data is written to a single shard covering Jan 1, 1970.
>
>
> > insert msns value=42 147633619
> > select * from msns
> name: msns
> --
> time value
> 147633619 42
>
> > precision rfc3339
> > select * from msns
> name: msns
> --
> time value
> 1970-01-01T00:24:36.33619Z 42
>
> That's why everything is fast, because all the data is in one shard.
>
> On Wed, Oct 12, 2016 at 9:50 PM, Tanya Unterberger <
> tanya.unterber...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> I can reproduce all the CPU issues, slowness, etc. if I try to import the
>> data that I have in milliseconds, specifying precision as milliseconds.
>>
>> If I insert the same data without specifying any precision and query
>> without specifying any precision, the database is lightingly fast. The same
>> data.
>>
>> The reason I was adding precision=ms is that I thought it was the right
>> thing to do. The manual advises that Influx sores the data in nanoseconds
>> but to use the lowest precision to insert. So at some stage I even used
>> hours, but inserted the data with precision=h. When Influx tried to convert
>> that data to nanoseconds, index, etc, then it was having a hissy fit.
>>
>> Is it a bug or the manual should state that if you query the data at the
>> same precision as you insert, then you can go with the lowest precision and
>> do not specify what precision you are inserting?
>>
>> Thanks,
>> Tanya
>>
>> On 13 October 2016 at 10:26, Tanya Unterberger <
>> tanya.unterber...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> The data is from 1838 to 2016, daily (sparse at times). We need to
>>> retain it, therefore the default policy.
>>>
>>> Thanks,
>>> Tanya
>>>
>>> On 13 October 2016 at 06:26, Sean Beckett  wrote:
>>>
 Tanya, what range of time does your data cover? What are the retention
 policies on the database?

 On Tue, Oct 11, 2016 at 11:14 PM, Tanya Unterberger <
 tanya.unterber...@gmail.com> wrote:

> Hi Sean,
>
> 1. Initially I killed the process
> 2. At some point I restarted influxdb service
> 3. Error logs show no errors
> 4. I rebuilt the server, installed the latest rpm. Reimported the data
> via scripts. Data goes in, but the server is unusable. Looks like indexing
> might be stuffed. The size of the data in that database is 38M. Total size
> of /var/lib/influxdb/data/ 273M
> 5. CPU went beserk and doesn't come down
> 6. A query like select count(blah) to the measurement that was batch
> inserted (10k records at a time) is unusable and times out
> 7. I need to import around 15 million records. How should I throttle
> that?
>
> At the moment I am pulling my hair out (not a pretty sight)
>
> Thanks a lot!
> Tanya
>
> On 12 October 2016 at 06:11, Sean Beckett  wrote:
>
>>
>>
>> On Tue, Oct 11, 2016 at 12:11 AM, 
>> wrote:
>>
>>> Hi,
>>>
>>> It seems that the old issue might have surfaced again (#3349) in
>>> v1.0.
>>>
>>> I tried to insert a large number of records (3913595) via a script,
>>> inserting 1 rows at a time.
>>>
>>> After a while I received
>>>
>>> HTTP/1.1 500 Internal Server Error
>>> Content-Type: application/json
>>> Request-Id: ac8ebbbe-8f70-11e6-8ce7-
>>> X-Influxdb-Version: 1.0.0
>>> Date: Tue, 11 Oct 2016 05:12:02 GMT
>>> Content-Length: 20
>>>
>>> {"error":"timeout"}
>>> HTTP/1.1 100 Continue
>>>
>>> I killed the process, after which the whole box became pretty much
>>> unresponsive.
>>>
>>
>> Killed the InfluxDB process, or the batch writing script process?
>>
>>
>>>
>>> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives
>>> me nothing) although the setting for http logging is true:
>>>
>>
>> systemd OSes put the logs in a new place (yay!?). See
>> http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd
>> for how to read the logs.
>>
>>
>>>
>>> [http]
>>>   enabled = true
>>>   bind-address = ":8086"
>>>   auth-enabled = true
>>>   log-enabled = true
>>>
>>> I tried to restart influx, but got the following error:
>>>
>>> Failed to connect to http://localhost:8086
>>> Please check your connection settings and ensure 'influxd' is
>>> running.
>>>
>>
>> The `influx` console is just a fancy wrapper on the API. That error
>> doesn't mean much except that the HTTP listener in 

Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-12 Thread Sean Beckett
Tonya, when you write the data in ms but don't specify the precision, the
database interprets those millisecond timestamps as nanoseconds, and all
the data is written to a single shard covering Jan 1, 1970.


> insert msns value=42 147633619
> select * from msns
name: msns
--
time value
147633619 42

> precision rfc3339
> select * from msns
name: msns
--
time value
1970-01-01T00:24:36.33619Z 42

That's why everything is fast, because all the data is in one shard.

On Wed, Oct 12, 2016 at 9:50 PM, Tanya Unterberger <
tanya.unterber...@gmail.com> wrote:

> Hi Sean,
>
> I can reproduce all the CPU issues, slowness, etc. if I try to import the
> data that I have in milliseconds, specifying precision as milliseconds.
>
> If I insert the same data without specifying any precision and query
> without specifying any precision, the database is lightingly fast. The same
> data.
>
> The reason I was adding precision=ms is that I thought it was the right
> thing to do. The manual advises that Influx sores the data in nanoseconds
> but to use the lowest precision to insert. So at some stage I even used
> hours, but inserted the data with precision=h. When Influx tried to convert
> that data to nanoseconds, index, etc, then it was having a hissy fit.
>
> Is it a bug or the manual should state that if you query the data at the
> same precision as you insert, then you can go with the lowest precision and
> do not specify what precision you are inserting?
>
> Thanks,
> Tanya
>
> On 13 October 2016 at 10:26, Tanya Unterberger <
> tanya.unterber...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> The data is from 1838 to 2016, daily (sparse at times). We need to retain
>> it, therefore the default policy.
>>
>> Thanks,
>> Tanya
>>
>> On 13 October 2016 at 06:26, Sean Beckett  wrote:
>>
>>> Tanya, what range of time does your data cover? What are the retention
>>> policies on the database?
>>>
>>> On Tue, Oct 11, 2016 at 11:14 PM, Tanya Unterberger <
>>> tanya.unterber...@gmail.com> wrote:
>>>
 Hi Sean,

 1. Initially I killed the process
 2. At some point I restarted influxdb service
 3. Error logs show no errors
 4. I rebuilt the server, installed the latest rpm. Reimported the data
 via scripts. Data goes in, but the server is unusable. Looks like indexing
 might be stuffed. The size of the data in that database is 38M. Total size
 of /var/lib/influxdb/data/ 273M
 5. CPU went beserk and doesn't come down
 6. A query like select count(blah) to the measurement that was batch
 inserted (10k records at a time) is unusable and times out
 7. I need to import around 15 million records. How should I throttle
 that?

 At the moment I am pulling my hair out (not a pretty sight)

 Thanks a lot!
 Tanya

 On 12 October 2016 at 06:11, Sean Beckett  wrote:

>
>
> On Tue, Oct 11, 2016 at 12:11 AM,  wrote:
>
>> Hi,
>>
>> It seems that the old issue might have surfaced again (#3349) in v1.0.
>>
>> I tried to insert a large number of records (3913595) via a script,
>> inserting 1 rows at a time.
>>
>> After a while I received
>>
>> HTTP/1.1 500 Internal Server Error
>> Content-Type: application/json
>> Request-Id: ac8ebbbe-8f70-11e6-8ce7-
>> X-Influxdb-Version: 1.0.0
>> Date: Tue, 11 Oct 2016 05:12:02 GMT
>> Content-Length: 20
>>
>> {"error":"timeout"}
>> HTTP/1.1 100 Continue
>>
>> I killed the process, after which the whole box became pretty much
>> unresponsive.
>>
>
> Killed the InfluxDB process, or the batch writing script process?
>
>
>>
>> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives
>> me nothing) although the setting for http logging is true:
>>
>
> systemd OSes put the logs in a new place (yay!?). See
> http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd
> for how to read the logs.
>
>
>>
>> [http]
>>   enabled = true
>>   bind-address = ":8086"
>>   auth-enabled = true
>>   log-enabled = true
>>
>> I tried to restart influx, but got the following error:
>>
>> Failed to connect to http://localhost:8086
>> Please check your connection settings and ensure 'influxd' is running.
>>
>
> The `influx` console is just a fancy wrapper on the API. That error
> doesn't mean much except that the HTTP listener in InfluxDB is not yet up
> and running.
>
>
>>
>> Although I can see that influxd is up an running:
>>
>> > systemctl | grep influx
>> influxdb.service
>> loaded active running   InfluxDB is an open-source,
>> distributed, time series database
>>
>> What do I do now?
>>
>
> Check the logs as referenced above.
>
> 

Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-12 Thread Sean Beckett
That's the entire source of the issue. The system is creating 1 week shards
from 1838 to now. That's a bit over 9000 shard groups, each of which only
has a few hundred points. The shard files are incredibly sparse, and the
overhead for each one is fixed.

Use shards durations of 10 years or more. That way each shard will have >
200k points and there will only be 18 or fewer shards.

Eliminate the duplicate points if you can, but with longer shard durations
the system should be much more performant, and the overwrites may not be an
issue.

On Wed, Oct 12, 2016 at 5:26 PM, Tanya Unterberger <
tanya.unterber...@gmail.com> wrote:

> Hi Sean,
>
> The data is from 1838 to 2016, daily (sparse at times). We need to retain
> it, therefore the default policy.
>
> Thanks,
> Tanya
>
> On 13 October 2016 at 06:26, Sean Beckett  wrote:
>
>> Tanya, what range of time does your data cover? What are the retention
>> policies on the database?
>>
>> On Tue, Oct 11, 2016 at 11:14 PM, Tanya Unterberger <
>> tanya.unterber...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> 1. Initially I killed the process
>>> 2. At some point I restarted influxdb service
>>> 3. Error logs show no errors
>>> 4. I rebuilt the server, installed the latest rpm. Reimported the data
>>> via scripts. Data goes in, but the server is unusable. Looks like indexing
>>> might be stuffed. The size of the data in that database is 38M. Total size
>>> of /var/lib/influxdb/data/ 273M
>>> 5. CPU went beserk and doesn't come down
>>> 6. A query like select count(blah) to the measurement that was batch
>>> inserted (10k records at a time) is unusable and times out
>>> 7. I need to import around 15 million records. How should I throttle
>>> that?
>>>
>>> At the moment I am pulling my hair out (not a pretty sight)
>>>
>>> Thanks a lot!
>>> Tanya
>>>
>>> On 12 October 2016 at 06:11, Sean Beckett  wrote:
>>>


 On Tue, Oct 11, 2016 at 12:11 AM,  wrote:

> Hi,
>
> It seems that the old issue might have surfaced again (#3349) in v1.0.
>
> I tried to insert a large number of records (3913595) via a script,
> inserting 1 rows at a time.
>
> After a while I received
>
> HTTP/1.1 500 Internal Server Error
> Content-Type: application/json
> Request-Id: ac8ebbbe-8f70-11e6-8ce7-
> X-Influxdb-Version: 1.0.0
> Date: Tue, 11 Oct 2016 05:12:02 GMT
> Content-Length: 20
>
> {"error":"timeout"}
> HTTP/1.1 100 Continue
>
> I killed the process, after which the whole box became pretty much
> unresponsive.
>

 Killed the InfluxDB process, or the batch writing script process?


>
> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives me
> nothing) although the setting for http logging is true:
>

 systemd OSes put the logs in a new place (yay!?). See
 http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd
 for how to read the logs.


>
> [http]
>   enabled = true
>   bind-address = ":8086"
>   auth-enabled = true
>   log-enabled = true
>
> I tried to restart influx, but got the following error:
>
> Failed to connect to http://localhost:8086
> Please check your connection settings and ensure 'influxd' is running.
>

 The `influx` console is just a fancy wrapper on the API. That error
 doesn't mean much except that the HTTP listener in InfluxDB is not yet up
 and running.


>
> Although I can see that influxd is up an running:
>
> > systemctl | grep influx
> influxdb.service
> loaded active running   InfluxDB is an open-source,
> distributed, time series database
>
> What do I do now?
>

 Check the logs as referenced above.

 The non-responsiveness on startup isn't surprising. It sounds like the
 system was overwhelmed with writes, which means that the WAL would have
 many points cached, waiting to be flushed to disk. On restart, InfluxDB
 won't accept new writes or queries until the cached ones in the WAL have
 persisted. For this reason, the HTTP listener is off until the WAL is
 flushed.


>
> I tried the same import over the weekend, then the script timeout
> happened eventually but the result was the same unresponsive, unusable
> server. We rebuilt the box and started again.
>

 It sounds like the box is just overwhelmed. Did you get backoff
 messages from the writes before the crash? What are the machine specs?



>
> Perhaps it is worthwhile mentioning that the same measurement already
> contained about 9 million records. Some of these records had the same
> timestamp as the ones I tried to import, i.e. they should have been 
> merged.
>

 Overwriting points is much much 

Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-12 Thread Tanya Unterberger
Hi Sean,

I can reproduce all the CPU issues, slowness, etc. if I try to import the
data that I have in milliseconds, specifying precision as milliseconds.

If I insert the same data without specifying any precision and query
without specifying any precision, the database is lightingly fast. The same
data.

The reason I was adding precision=ms is that I thought it was the right
thing to do. The manual advises that Influx sores the data in nanoseconds
but to use the lowest precision to insert. So at some stage I even used
hours, but inserted the data with precision=h. When Influx tried to convert
that data to nanoseconds, index, etc, then it was having a hissy fit.

Is it a bug or the manual should state that if you query the data at the
same precision as you insert, then you can go with the lowest precision and
do not specify what precision you are inserting?

Thanks,
Tanya

On 13 October 2016 at 10:26, Tanya Unterberger 
wrote:

> Hi Sean,
>
> The data is from 1838 to 2016, daily (sparse at times). We need to retain
> it, therefore the default policy.
>
> Thanks,
> Tanya
>
> On 13 October 2016 at 06:26, Sean Beckett  wrote:
>
>> Tanya, what range of time does your data cover? What are the retention
>> policies on the database?
>>
>> On Tue, Oct 11, 2016 at 11:14 PM, Tanya Unterberger <
>> tanya.unterber...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> 1. Initially I killed the process
>>> 2. At some point I restarted influxdb service
>>> 3. Error logs show no errors
>>> 4. I rebuilt the server, installed the latest rpm. Reimported the data
>>> via scripts. Data goes in, but the server is unusable. Looks like indexing
>>> might be stuffed. The size of the data in that database is 38M. Total size
>>> of /var/lib/influxdb/data/ 273M
>>> 5. CPU went beserk and doesn't come down
>>> 6. A query like select count(blah) to the measurement that was batch
>>> inserted (10k records at a time) is unusable and times out
>>> 7. I need to import around 15 million records. How should I throttle
>>> that?
>>>
>>> At the moment I am pulling my hair out (not a pretty sight)
>>>
>>> Thanks a lot!
>>> Tanya
>>>
>>> On 12 October 2016 at 06:11, Sean Beckett  wrote:
>>>


 On Tue, Oct 11, 2016 at 12:11 AM,  wrote:

> Hi,
>
> It seems that the old issue might have surfaced again (#3349) in v1.0.
>
> I tried to insert a large number of records (3913595) via a script,
> inserting 1 rows at a time.
>
> After a while I received
>
> HTTP/1.1 500 Internal Server Error
> Content-Type: application/json
> Request-Id: ac8ebbbe-8f70-11e6-8ce7-
> X-Influxdb-Version: 1.0.0
> Date: Tue, 11 Oct 2016 05:12:02 GMT
> Content-Length: 20
>
> {"error":"timeout"}
> HTTP/1.1 100 Continue
>
> I killed the process, after which the whole box became pretty much
> unresponsive.
>

 Killed the InfluxDB process, or the batch writing script process?


>
> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives me
> nothing) although the setting for http logging is true:
>

 systemd OSes put the logs in a new place (yay!?). See
 http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd
 for how to read the logs.


>
> [http]
>   enabled = true
>   bind-address = ":8086"
>   auth-enabled = true
>   log-enabled = true
>
> I tried to restart influx, but got the following error:
>
> Failed to connect to http://localhost:8086
> Please check your connection settings and ensure 'influxd' is running.
>

 The `influx` console is just a fancy wrapper on the API. That error
 doesn't mean much except that the HTTP listener in InfluxDB is not yet up
 and running.


>
> Although I can see that influxd is up an running:
>
> > systemctl | grep influx
> influxdb.service
> loaded active running   InfluxDB is an open-source,
> distributed, time series database
>
> What do I do now?
>

 Check the logs as referenced above.

 The non-responsiveness on startup isn't surprising. It sounds like the
 system was overwhelmed with writes, which means that the WAL would have
 many points cached, waiting to be flushed to disk. On restart, InfluxDB
 won't accept new writes or queries until the cached ones in the WAL have
 persisted. For this reason, the HTTP listener is off until the WAL is
 flushed.


>
> I tried the same import over the weekend, then the script timeout
> happened eventually but the result was the same unresponsive, unusable
> server. We rebuilt the box and started again.
>

 It sounds like the box is just overwhelmed. Did you get backoff
 messages from the writes before the crash? 

Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-12 Thread Tanya Unterberger
Hi Sean,

The data is from 1838 to 2016, daily (sparse at times). We need to retain
it, therefore the default policy.

Thanks,
Tanya

On 13 October 2016 at 06:26, Sean Beckett  wrote:

> Tanya, what range of time does your data cover? What are the retention
> policies on the database?
>
> On Tue, Oct 11, 2016 at 11:14 PM, Tanya Unterberger <
> tanya.unterber...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> 1. Initially I killed the process
>> 2. At some point I restarted influxdb service
>> 3. Error logs show no errors
>> 4. I rebuilt the server, installed the latest rpm. Reimported the data
>> via scripts. Data goes in, but the server is unusable. Looks like indexing
>> might be stuffed. The size of the data in that database is 38M. Total size
>> of /var/lib/influxdb/data/ 273M
>> 5. CPU went beserk and doesn't come down
>> 6. A query like select count(blah) to the measurement that was batch
>> inserted (10k records at a time) is unusable and times out
>> 7. I need to import around 15 million records. How should I throttle that?
>>
>> At the moment I am pulling my hair out (not a pretty sight)
>>
>> Thanks a lot!
>> Tanya
>>
>> On 12 October 2016 at 06:11, Sean Beckett  wrote:
>>
>>>
>>>
>>> On Tue, Oct 11, 2016 at 12:11 AM,  wrote:
>>>
 Hi,

 It seems that the old issue might have surfaced again (#3349) in v1.0.

 I tried to insert a large number of records (3913595) via a script,
 inserting 1 rows at a time.

 After a while I received

 HTTP/1.1 500 Internal Server Error
 Content-Type: application/json
 Request-Id: ac8ebbbe-8f70-11e6-8ce7-
 X-Influxdb-Version: 1.0.0
 Date: Tue, 11 Oct 2016 05:12:02 GMT
 Content-Length: 20

 {"error":"timeout"}
 HTTP/1.1 100 Continue

 I killed the process, after which the whole box became pretty much
 unresponsive.

>>>
>>> Killed the InfluxDB process, or the batch writing script process?
>>>
>>>

 There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives me
 nothing) although the setting for http logging is true:

>>>
>>> systemd OSes put the logs in a new place (yay!?). See
>>> http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd
>>> for how to read the logs.
>>>
>>>

 [http]
   enabled = true
   bind-address = ":8086"
   auth-enabled = true
   log-enabled = true

 I tried to restart influx, but got the following error:

 Failed to connect to http://localhost:8086
 Please check your connection settings and ensure 'influxd' is running.

>>>
>>> The `influx` console is just a fancy wrapper on the API. That error
>>> doesn't mean much except that the HTTP listener in InfluxDB is not yet up
>>> and running.
>>>
>>>

 Although I can see that influxd is up an running:

 > systemctl | grep influx
 influxdb.service
   loaded active running   InfluxDB is an open-source,
 distributed, time series database

 What do I do now?

>>>
>>> Check the logs as referenced above.
>>>
>>> The non-responsiveness on startup isn't surprising. It sounds like the
>>> system was overwhelmed with writes, which means that the WAL would have
>>> many points cached, waiting to be flushed to disk. On restart, InfluxDB
>>> won't accept new writes or queries until the cached ones in the WAL have
>>> persisted. For this reason, the HTTP listener is off until the WAL is
>>> flushed.
>>>
>>>

 I tried the same import over the weekend, then the script timeout
 happened eventually but the result was the same unresponsive, unusable
 server. We rebuilt the box and started again.

>>>
>>> It sounds like the box is just overwhelmed. Did you get backoff messages
>>> from the writes before the crash? What are the machine specs?
>>>
>>>
>>>

 Perhaps it is worthwhile mentioning that the same measurement already
 contained about 9 million records. Some of these records had the same
 timestamp as the ones I tried to import, i.e. they should have been merged.

>>>
>>> Overwriting points is much much more expensive than posting new points.
>>> Each overwritten point triggers a tombstone record which must later be
>>> processed. This can trigger frequent compactions of the TSM files. With a
>>> high write load and frequent compactions, the system would encounter
>>> significant CPU pressure.
>>>
>>>

 Interestingly enough the same amount of data was fine when I forgot to
 add precision in ms, i.e. all records were imported as nanoseconds, but in
 fact they "lacked" 6 zeroes.

>>>
>>> That would mean all points are going to the same shard. It is more
>>> resource intensive to load points across a wide range of time, since more
>>> shard files are involved. InfluxDB does best with sequential
>>> chronologically ordered unique points from the very recent 

Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-12 Thread Sean Beckett
Tanya, what range of time does your data cover? What are the retention
policies on the database?

On Tue, Oct 11, 2016 at 11:14 PM, Tanya Unterberger <
tanya.unterber...@gmail.com> wrote:

> Hi Sean,
>
> 1. Initially I killed the process
> 2. At some point I restarted influxdb service
> 3. Error logs show no errors
> 4. I rebuilt the server, installed the latest rpm. Reimported the data via
> scripts. Data goes in, but the server is unusable. Looks like indexing
> might be stuffed. The size of the data in that database is 38M. Total size
> of /var/lib/influxdb/data/ 273M
> 5. CPU went beserk and doesn't come down
> 6. A query like select count(blah) to the measurement that was batch
> inserted (10k records at a time) is unusable and times out
> 7. I need to import around 15 million records. How should I throttle that?
>
> At the moment I am pulling my hair out (not a pretty sight)
>
> Thanks a lot!
> Tanya
>
> On 12 October 2016 at 06:11, Sean Beckett  wrote:
>
>>
>>
>> On Tue, Oct 11, 2016 at 12:11 AM,  wrote:
>>
>>> Hi,
>>>
>>> It seems that the old issue might have surfaced again (#3349) in v1.0.
>>>
>>> I tried to insert a large number of records (3913595) via a script,
>>> inserting 1 rows at a time.
>>>
>>> After a while I received
>>>
>>> HTTP/1.1 500 Internal Server Error
>>> Content-Type: application/json
>>> Request-Id: ac8ebbbe-8f70-11e6-8ce7-
>>> X-Influxdb-Version: 1.0.0
>>> Date: Tue, 11 Oct 2016 05:12:02 GMT
>>> Content-Length: 20
>>>
>>> {"error":"timeout"}
>>> HTTP/1.1 100 Continue
>>>
>>> I killed the process, after which the whole box became pretty much
>>> unresponsive.
>>>
>>
>> Killed the InfluxDB process, or the batch writing script process?
>>
>>
>>>
>>> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives me
>>> nothing) although the setting for http logging is true:
>>>
>>
>> systemd OSes put the logs in a new place (yay!?). See
>> http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd
>> for how to read the logs.
>>
>>
>>>
>>> [http]
>>>   enabled = true
>>>   bind-address = ":8086"
>>>   auth-enabled = true
>>>   log-enabled = true
>>>
>>> I tried to restart influx, but got the following error:
>>>
>>> Failed to connect to http://localhost:8086
>>> Please check your connection settings and ensure 'influxd' is running.
>>>
>>
>> The `influx` console is just a fancy wrapper on the API. That error
>> doesn't mean much except that the HTTP listener in InfluxDB is not yet up
>> and running.
>>
>>
>>>
>>> Although I can see that influxd is up an running:
>>>
>>> > systemctl | grep influx
>>> influxdb.service
>>>   loaded active running   InfluxDB is an open-source,
>>> distributed, time series database
>>>
>>> What do I do now?
>>>
>>
>> Check the logs as referenced above.
>>
>> The non-responsiveness on startup isn't surprising. It sounds like the
>> system was overwhelmed with writes, which means that the WAL would have
>> many points cached, waiting to be flushed to disk. On restart, InfluxDB
>> won't accept new writes or queries until the cached ones in the WAL have
>> persisted. For this reason, the HTTP listener is off until the WAL is
>> flushed.
>>
>>
>>>
>>> I tried the same import over the weekend, then the script timeout
>>> happened eventually but the result was the same unresponsive, unusable
>>> server. We rebuilt the box and started again.
>>>
>>
>> It sounds like the box is just overwhelmed. Did you get backoff messages
>> from the writes before the crash? What are the machine specs?
>>
>>
>>
>>>
>>> Perhaps it is worthwhile mentioning that the same measurement already
>>> contained about 9 million records. Some of these records had the same
>>> timestamp as the ones I tried to import, i.e. they should have been merged.
>>>
>>
>> Overwriting points is much much more expensive than posting new points.
>> Each overwritten point triggers a tombstone record which must later be
>> processed. This can trigger frequent compactions of the TSM files. With a
>> high write load and frequent compactions, the system would encounter
>> significant CPU pressure.
>>
>>
>>>
>>> Interestingly enough the same amount of data was fine when I forgot to
>>> add precision in ms, i.e. all records were imported as nanoseconds, but in
>>> fact they "lacked" 6 zeroes.
>>>
>>
>> That would mean all points are going to the same shard. It is more
>> resource intensive to load points across a wide range of time, since more
>> shard files are involved. InfluxDB does best with sequential
>> chronologically ordered unique points from the very recent past. The more
>> the write operation differs from that, the lower the throughput.
>>
>>
>>>
>>> Please advise what kind of action I can take.
>>>
>>
>> Look in the logs for errors. Throttle the writes. Don't overwrite more
>> points than you have to.
>>
>>
>>>
>>> Thanks a lot!
>>> Tanya
>>>
>>> --
>>> Remember to include the InfluxDB version 

Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-11 Thread Tanya Unterberger
Hi Sean,

1. Initially I killed the process
2. At some point I restarted influxdb service
3. Error logs show no errors
4. I rebuilt the server, installed the latest rpm. Reimported the data via
scripts. Data goes in, but the server is unusable. Looks like indexing
might be stuffed. The size of the data in that database is 38M. Total size
of /var/lib/influxdb/data/ 273M
5. CPU went beserk and doesn't come down
6. A query like select count(blah) to the measurement that was batch
inserted (10k records at a time) is unusable and times out
7. I need to import around 15 million records. How should I throttle that?

At the moment I am pulling my hair out (not a pretty sight)

Thanks a lot!
Tanya

On 12 October 2016 at 06:11, Sean Beckett  wrote:

>
>
> On Tue, Oct 11, 2016 at 12:11 AM,  wrote:
>
>> Hi,
>>
>> It seems that the old issue might have surfaced again (#3349) in v1.0.
>>
>> I tried to insert a large number of records (3913595) via a script,
>> inserting 1 rows at a time.
>>
>> After a while I received
>>
>> HTTP/1.1 500 Internal Server Error
>> Content-Type: application/json
>> Request-Id: ac8ebbbe-8f70-11e6-8ce7-
>> X-Influxdb-Version: 1.0.0
>> Date: Tue, 11 Oct 2016 05:12:02 GMT
>> Content-Length: 20
>>
>> {"error":"timeout"}
>> HTTP/1.1 100 Continue
>>
>> I killed the process, after which the whole box became pretty much
>> unresponsive.
>>
>
> Killed the InfluxDB process, or the batch writing script process?
>
>
>>
>> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives me
>> nothing) although the setting for http logging is true:
>>
>
> systemd OSes put the logs in a new place (yay!?). See
> http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd for
> how to read the logs.
>
>
>>
>> [http]
>>   enabled = true
>>   bind-address = ":8086"
>>   auth-enabled = true
>>   log-enabled = true
>>
>> I tried to restart influx, but got the following error:
>>
>> Failed to connect to http://localhost:8086
>> Please check your connection settings and ensure 'influxd' is running.
>>
>
> The `influx` console is just a fancy wrapper on the API. That error
> doesn't mean much except that the HTTP listener in InfluxDB is not yet up
> and running.
>
>
>>
>> Although I can see that influxd is up an running:
>>
>> > systemctl | grep influx
>> influxdb.service
>> loaded active running   InfluxDB is an open-source,
>> distributed, time series database
>>
>> What do I do now?
>>
>
> Check the logs as referenced above.
>
> The non-responsiveness on startup isn't surprising. It sounds like the
> system was overwhelmed with writes, which means that the WAL would have
> many points cached, waiting to be flushed to disk. On restart, InfluxDB
> won't accept new writes or queries until the cached ones in the WAL have
> persisted. For this reason, the HTTP listener is off until the WAL is
> flushed.
>
>
>>
>> I tried the same import over the weekend, then the script timeout
>> happened eventually but the result was the same unresponsive, unusable
>> server. We rebuilt the box and started again.
>>
>
> It sounds like the box is just overwhelmed. Did you get backoff messages
> from the writes before the crash? What are the machine specs?
>
>
>
>>
>> Perhaps it is worthwhile mentioning that the same measurement already
>> contained about 9 million records. Some of these records had the same
>> timestamp as the ones I tried to import, i.e. they should have been merged.
>>
>
> Overwriting points is much much more expensive than posting new points.
> Each overwritten point triggers a tombstone record which must later be
> processed. This can trigger frequent compactions of the TSM files. With a
> high write load and frequent compactions, the system would encounter
> significant CPU pressure.
>
>
>>
>> Interestingly enough the same amount of data was fine when I forgot to
>> add precision in ms, i.e. all records were imported as nanoseconds, but in
>> fact they "lacked" 6 zeroes.
>>
>
> That would mean all points are going to the same shard. It is more
> resource intensive to load points across a wide range of time, since more
> shard files are involved. InfluxDB does best with sequential
> chronologically ordered unique points from the very recent past. The more
> the write operation differs from that, the lower the throughput.
>
>
>>
>> Please advise what kind of action I can take.
>>
>
> Look in the logs for errors. Throttle the writes. Don't overwrite more
> points than you have to.
>
>
>>
>> Thanks a lot!
>> Tanya
>>
>> --
>> Remember to include the InfluxDB version number with all issue reports
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "InfluxDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to influxdb+unsubscr...@googlegroups.com.
>> To post to this group, send email to influxdb@googlegroups.com.
>> Visit this group at 

Re: [influxdb] Internal server error, timeout and unusable server after large imports

2016-10-11 Thread Sean Beckett
On Tue, Oct 11, 2016 at 12:11 AM,  wrote:

> Hi,
>
> It seems that the old issue might have surfaced again (#3349) in v1.0.
>
> I tried to insert a large number of records (3913595) via a script,
> inserting 1 rows at a time.
>
> After a while I received
>
> HTTP/1.1 500 Internal Server Error
> Content-Type: application/json
> Request-Id: ac8ebbbe-8f70-11e6-8ce7-
> X-Influxdb-Version: 1.0.0
> Date: Tue, 11 Oct 2016 05:12:02 GMT
> Content-Length: 20
>
> {"error":"timeout"}
> HTTP/1.1 100 Continue
>
> I killed the process, after which the whole box became pretty much
> unresponsive.
>

Killed the InfluxDB process, or the batch writing script process?


>
> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives me
> nothing) although the setting for http logging is true:
>

systemd OSes put the logs in a new place (yay!?). See
http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd for
how to read the logs.


>
> [http]
>   enabled = true
>   bind-address = ":8086"
>   auth-enabled = true
>   log-enabled = true
>
> I tried to restart influx, but got the following error:
>
> Failed to connect to http://localhost:8086
> Please check your connection settings and ensure 'influxd' is running.
>

The `influx` console is just a fancy wrapper on the API. That error doesn't
mean much except that the HTTP listener in InfluxDB is not yet up and
running.


>
> Although I can see that influxd is up an running:
>
> > systemctl | grep influx
> influxdb.service
> loaded active running   InfluxDB is an open-source,
> distributed, time series database
>
> What do I do now?
>

Check the logs as referenced above.

The non-responsiveness on startup isn't surprising. It sounds like the
system was overwhelmed with writes, which means that the WAL would have
many points cached, waiting to be flushed to disk. On restart, InfluxDB
won't accept new writes or queries until the cached ones in the WAL have
persisted. For this reason, the HTTP listener is off until the WAL is
flushed.


>
> I tried the same import over the weekend, then the script timeout happened
> eventually but the result was the same unresponsive, unusable server. We
> rebuilt the box and started again.
>

It sounds like the box is just overwhelmed. Did you get backoff messages
from the writes before the crash? What are the machine specs?



>
> Perhaps it is worthwhile mentioning that the same measurement already
> contained about 9 million records. Some of these records had the same
> timestamp as the ones I tried to import, i.e. they should have been merged.
>

Overwriting points is much much more expensive than posting new points.
Each overwritten point triggers a tombstone record which must later be
processed. This can trigger frequent compactions of the TSM files. With a
high write load and frequent compactions, the system would encounter
significant CPU pressure.


>
> Interestingly enough the same amount of data was fine when I forgot to add
> precision in ms, i.e. all records were imported as nanoseconds, but in fact
> they "lacked" 6 zeroes.
>

That would mean all points are going to the same shard. It is more resource
intensive to load points across a wide range of time, since more shard
files are involved. InfluxDB does best with sequential chronologically
ordered unique points from the very recent past. The more the write
operation differs from that, the lower the throughput.


>
> Please advise what kind of action I can take.
>

Look in the logs for errors. Throttle the writes. Don't overwrite more
points than you have to.


>
> Thanks a lot!
> Tanya
>
> --
> Remember to include the InfluxDB version number with all issue reports
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to influxdb+unsubscr...@googlegroups.com.
> To post to this group, send email to influxdb@googlegroups.com.
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/influxdb/f4ebdb56-32f9-4fb6-88de-f7ef603c4262%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to influxdb+unsubscr...@googlegroups.com.
To post to this group, send email to influxdb@googlegroups.com.
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvMCu%3DM9eR5NOky-LRAiqRU5cnCDJa0SBjRrz5_Wt0tT8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.