Monitor Data Retention
From UniCluster
Contents |
Overview
The monitoring data that is exposed through the UniCluster Express Monitor Console is stored in two repositories. Cluster host metric information is stored in a series of Ganglia managed Round-Robin Database's while Grid Engine job information is stored in the ARCo PostgreSQL database. A default installation of UniCluster Express uses Ganglia's default data retention policy for the data in RRD. The ARCo data stored in UniCluster Express is subject to the default ARCo data retention policy.
Ganglia and RRD
The default Ganglia RRD configuration is adequate for many uses. The default aging policy can be seen by looking at <unicluster-express-home>/etc/gmetad.conf (on a main node):
# # Round-Robin Archives # You can specify custom Round-Robin archives here (defaults are listed below) # # RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \ # "RRA:AVERAGE:0.5:5760:374"
Each "RRA" string defines a Round-Robin Archive. The values that affect the quantity and precision of the data are the last two numbers; the step and the row count. Basically the step is how many "base" datapoints go into making a single "archived" datapoint. The row count is simply the number of records to hold at the given base-to-archived ratio. To figure out how long data for a given metric will stay around (and at what precision) you need to look at <unicluster-express-home>/etc/gmond.conf (on any cluster node) to get the step value for a given metric. For example the "load" metrics from gmond.conf are configured like:
collection_group {
collect_every = 20
time_threshold = 90
/* Load Averages */
metric {
name = "load_one"
value_threshold = "1.0"
}
metric {
name = "load_five"
value_threshold = "1.0"
}
metric {
name = "load_fifteen"
value_threshold = "1.0"
}
}
The "collect_every=20" means that these stats will be updated every 20 seconds thus having a 20 second "step". If we combine the data from gmetad.conf and gmond.conf together we can figure out how long the "load_one" data will stay around:
- The last 244 samples of the 1 step averaged value. This means we have the last 1 hour and 21 minutes of the data at 20 second intervals ( (244 * 1 * 20) / 60 ~= 1 hour and 21 minutes)
- The last 244 samples of the 24 step averaged value. This means we have the last 32 hours and 32 minutes of the data at 8 minute intervals ( (244 * 24 * 20) / 60 = 32 hours and 32 minutes)
- The last 244 samples of the 168 step averaged value. This means we have the last 9 days and 12 hours of the data at 56 minute intervals ( (244 * 168 * 20) / 60 ~= 9 days and 12 hours)
- The last 244 samples of the 672 step averaged value. This means we have the last 38 days of the data at 224 minute intervals ( (244 * 672 * 20) / 60 ~= 38 days)
- The last 374 samples of the 5760 step averaged value. This means we have the last 325 days of the data at 32 hour intervals ( (244 * 5760 * 20) / 60 ~= 325 days)
Any data over 325 days old is dropped from the RRD database in this example. Since other metrics may have different step times they may be in the database longer or shorter than "load_one". With this default configuration of Ganglia each hosts' set of rrd database files (one for each metric) is only 456K bytes.
If you want to change the data retention policy simply add your desired "RRAs" line to <unicluster-express-home>/etc/gmetad.conf. There is one caveat, however, you must delete your old RRD databases in <unicluster-express-home>/var/lib/ganglia/rrds and restart gmetad.
# # Round-Robin Archives # You can specify custom Round-Robin archives here (defaults are listed below) # # RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \ # "RRA:AVERAGE:0.5:5760:374" RRAs "RRA:AVERAGE:0.5:1:105408"
followed by:
$ /etc/init.d/gmetad stop $ rm <unicluster-express-home>/var/lib/ganglia/rrds/* $ /etc/init.d/gmetad start
In this example RRD will keep 105,408 rows of each metric at their collection frequency. Be wary of the impact of these changes...this example increased the size of each hosts' set of RRD database files to 389 M bytes!
ARCo
There are 2 major runtime files that the dbwriter component of ARCo uses to define and maintain the database.
- <install-dir>/sge/dbwriter/database/postgres/dbwriter.xml
- <install-dir>/sge/dbwriter/database/postgres/dbdefinition.xml
The dbwriter.xml file contains the schema for postgres database while dbdefinition.xml contains the derived, statistical value, and delete rules for the database. The delete rules define the Data Retention Policy of the ARCo database.
The default deletion rules are as follows:
<source lang="xml">
<delete scope="host_values" time_range="day" time_amount="7"> <sub_scope>np_load_avg</sub_scope> <sub_scope>cpu</sub_scope> <sub_scope>mem_free</sub_scope> <sub_scope>virtual_free</sub_scope> </delete>
<delete scope="host_values" time_range="year" time_amount="2"/>
<delete scope="queue_values" time_range="month" time_amount="1"> <sub_scope>slots</sub_scope> <sub_scope>state</sub_scope> </delete>
<delete scope="queue_values" time_range="year" time_amount="2"/> <delete scope="user_values" time_range="year" time_amount="2"/> <delete scope="group_values" time_range="year" time_amount="2"/> <delete scope="project_values" time_range="year" time_amount="2"/> <delete scope="department_values" time_range="year" time_amount="2"/>
<delete scope="job" time_range="year" time_amount="1"/> <delete scope="job_log" time_range="month" time_amount="1"/>
<delete scope="share_log" time_range="year" time_amount="1"/>
<delete scope="statistic_values" time_range="day" time_amount="2">
<sub_scope>lines_per_second</sub_scope>
</delete>
<delete scope="statistic_values" time_range="day" time_amount="7">
<sub_scope>row_count</sub_scope>
</delete>
<delete scope="statistic_values" time_range="year" time_amount="2"/>
</source>
The XML is pretty self explanatory. Simply updating the deletion tasks in this XML file will change the data retention policy for ARCo. The Console Monitor specifically uses the sge_job and sge_job_usage tables as data sources.
This page [1] contains more information on the deletion policy and provides a nice table of default values. There is even a OpenOffice spreadsheet at the bottom that will allow you to gauge database disk usage based on various cluster environment values.
This page [2] has some more info on general dbwriter configuration.
Back to Administrative How Tos.
