Collins - Concepts

The collins data model

Collins was designed from the beginning to represent assets in the simplest way possible. This simplicity makes for an efficient data-model and allows for a large degree of flexibility. Collins really only knows about a few different kinds of things. Collins has:

Assets
Status & States
Tags
Logs
Addresses

That's it. Everything you need to know about collins has to do with one of these types of things.

An asset in collins describes a thing, usually a piece of hardware or a configuration.

Assets themselves are very simple internally and only have a handful of meaningful attributes. An asset consists of a:

tag: A tag is a unique, immutable, alphanumeric sequence of characters associated with the asset. These are arbitrary but nearly all API calls for interacting with assets will reference this tag.
status: Indicates the current lifecycle phase of the asset. More details in the next section.
state: Indicates a phase (status) specific state of the asset. More details in the next section.
type: Current supported types include: Configuration, Data Center, Power Circuit, Power Strip, Rack, Router, Server Node and Switch. Of primary interest to most users will be server nodes and configurations.

New asset types are fairly easy to create but can not be created via the API (although this is on the roadmap). Status values are not generally added/changed, typically a new state is used instead.

Multi-dimensional key/values for an asset

The most fundamentally important concept in collins is that of a tag. A tag consists of a key (such as TOTAL_DISK_SPACE_BYTES), a value (such as 536870912000) and a dimension (such as 0). Tags are what provide most of the flexibility within collins. If you set a new tag on an asset and that tag does not yet exist, it will be available to query on and immediately available to query on. If you set a new tag value using the same dimension, the old value will be overwritten. All tag changes, including deletes, are logged as part of the asset audit trail.

Tag Types

Tags can be either managed, automated, or unmanaged.

Managed: Tags that are not user managable (such as CPU speed) and can only be set or changed during certain lifecycle events, such as the server intake process. These tags can not be changed, updated or deleted by a user. This is enforced in the software.
Automated: Tags that are managed by external automated processes such as server provisioning. Although these can technically be set/changed by users via the API, convention dictates that they should not be.
Unmanaged: Tags set by a user via some api interaction (most likely), but probably on a manual basis. In some cases these may be part of an automated process, but some user-interaction will dictate the tag value. These are typically used for things like notes, links, testing, etc.

Dimensions

A dimension can be any integer value (including a timestamp) and is typically used for grouping things together. You can imagine the disks of a host being described like:

{
    "0": {
      "DEV_NAME": "/dev/sda",
      "TOTAL_DISK_SPACE_BYTES": 536870912000,
      "DISK_TYPE": "SATA"
    },
    "1": {
      "DEV_NAME": "/dev/sdb",
      "TOTAL_DISK_SPACE_BYTES": 0,
      "DISK_TYPE": "CD-ROM"
    }
}

This information tells you that the asset has two disks (one sata, one cd-rom) and tells you a bit about each physical disk. The value of the dimension in this case is just to logically group things together, and possibly to provide some ordering. You could also however use the dimension to represent things like port allocation (i.e. each dimension represents a port in use on the asset), or order of some arbitrary event where the timestamp is the dimension used. It's up to you.

Tags

The flexible nature of the tag approach used with Collins dictates that agreed upon convention is required.

Managed Tags

Most hardware information is gleaned from LSHW and most networking information is taken from LLDP. The usefulness of the hardware information is dependent on the version of LSHW (recent versions get better information). The LLDP information in SoftLayer is entirely faked by converting the output from `ifconfig` to an LLDP compatible format.

Managed tags automatically use an additional dimension if required. For example, say you have two memory sticks and we're tracking a description and size for each stick. Collins will create two 'groups', one that represents each memory stick. This is how collins differentiates between the same tag being used multiple times for the same asset.

General Information

All of these are specified as HTTP parameters during the intake process. These are not inferred, but specified.

Tag	Type	Description
SERVICE_TAG	String	Vendor service tag (e.g. dell tag)
CHASSIS_TAG	String	Manufacturer serial number
RACK_POSITION	String	Physical position in the rack
POWER_PORT_A	String	The port in the CDU that the asset is attached to. The name and value will depend on your power configuration (detailed in configuration)

CPU Information

Collins assumes that each CPU is identical so only uses one dimension to represent all CPU's. Collins creates the visual hardware representation based on the CPU_COUNT. This information is taken from LSHW.

Tag	Type	Description
CPU_COUNT	Integer	Physical CPU count in machine
CPU_CORES	Integer	Number of cores per physical CPU
CPU_THREADS	Integer	Number of threads per core
CPU_SPEED_GHZ	Float	Speed of CPU
CPU_DESCRIPTION	String	Vendor description or label

Memory Information

Collins represents a stick of RAM using the dimension (dictates bank used), MEMORY_SIZE_BYTES and MEMORY_DESCRIPTION. Each dimension represents a stick of RAM present in an asset. Collins creates the visual representation of the RAM layout by using the MEMORY_BANKS_TOTAL and filling in the sticks that are missing. No additional data is stored on missing RAM. This information is taken from LSHW.

Tag	Type	Description
MEMORY_SIZE_BYTES	Long	Size of single RAM module
MEMORY_DESCRIPTION	String	Vendor description or label
MEMORY_SIZE_TOTAL	Long	Total number of bytes (`MEMORY_SIZE_BYTES` * number of modules)
MEMORY_BANKS_TOTAL	Integer	Total number of RAM banks

Network Information

Collins represents a NIC with its NIC_SPEED, MAC_ADDRESS and NIC_DESCRIPTION. Each dimension represents a different physical network card. This information is taken from LSHW.

Tag	Type	Description
NIC_SPEED	Long	Speed in bits per second
MAC_ADDRESS	String	MAC address of network interface
NIC_DESCRIPTION	String	Vendor description or label

Disk Information

Collins represents a disk with its DISK_SIZE_BYTES, DISK_TYPE and DISK_DESCRIPTION. Each dimension represents a different physical disk. This information is taken from LSHW.

Tag	Type	Description
DISK_SIZE_BYTES	Long	Size in bytes of disk. This will occasionally report as 0 when the disk is part of a RAID group, or in the case of a CD-ROM. The number of physical disks and storage total will still be accurate in this case.
DISK_TYPE	Enum	Type of disk. IDE, SCSI, PCIe, or CD-ROM.
DISK_DESCRIPTION	String	Vendor description or label
DISK_STORAGE_TOTAL	String	Sum of size for all disks. Because this number of often larger than a Long, it is stored as a String

LLDP Information

Collins represents each interface connected to an LLDP capable device using all of the information below. Each dimension represents a different physically connected network interface. Each interface may have zero or more VLAN's. This information is taken from LLDP.

LLDP_INTERFACE_NAME
LLDP_CHASSIS_NAME
LLDP_CHASSIS_ID_TYPE
LLDP_CHASSIS_ID_VALUE
LLDP_CHASSIS_DESCRIPTION
LLDP_PORT_ID_TYPE
LLDP_PORT_ID_VALUE
LLDP_PORT_DESCRIPTION
LLDP_VLAN_ID
LLDP_VLAN_NAME

Automated Tags

This data should be populated by largely automated processes. The list below is going to be fairly Tumblr specific with some exceptions. I have tried to denote tags that are likely Tumblr specific. If you use Collins as a fact/node terminus with puppet (or whatever system you use), this data can be useful for driving automation. I have tried to denote what Tumblr processes populate this data, but this is likely not very meaningful outside of Tumblr.

General Automated Tags

Non-tumblr specific automated tags, although some of the tools may be Tumblr specific.

Tag	Type	Populated By	Description
HOSTNAME	String	visioner	Populated when an asset successfully starts the provisioning process
NODECLASS	String	visioner	Populated when an asset successfully starts the provisioning process. This is used by the puppet master to resolve the nodeclass for a host.
CONTACT	String	collins	When a user provisions an asset the specified `CONTACT` is populated. This is used for notification of maintenance, provisioning being done, etc.
PRIMARY_ROLE	String	collins	When a user provisions an asset the chosen `PRIMARY_ROLE` is populated. This is used to communicate the intended purpose of the asset (CACHE, DATABASE, HADOOP, PROXY, WEB, etc)
POOL	String	collins	Populated along with `PRIMARY_ROLE`, this tag is used to communicate the functional pool the asset should participate in, e.g. MEMCACHE_POST_POOL
SECONDARY_ROLE	String	collins	Populated along with `PRIMARY_ROLE`, this tag is used to communicate some secondary role such as MASTER, SLAVE, or a deployment group (such as WEB_A, WEB_B).
SYSTEM_USERNAME	String	visioner	Set by visioner when a host is provisioned. In SoftLayer, reconciler populates this data as we pull it from the SL API. Typically 'root'.
SYSTEM_PASSWORD	String	visioner	Set by visioner when a host is provisioned. In SoftLayer, reconciler populates this data as we pull it from the SL API. Typically 'root'.
DATACENTER_NAME	Enum	Invisible Touch	Set invisible touch when a host goes through intake. Largely unused since we use multicollins deployments.
SYSTEM_COST	Integer	reconciler	Populated by reconciler in SoftLayer, populated by hand in EWR
CANCEL_TICKET	Long	collins	If the SoftLayer plugin is configured and an asset is cancelled, this value is populated

Tumblr Automated Tags

Tumblr specific automated tags

Tag	Type	Populated By	Description
POWER_STATUS	String	decommissioner	Tracks decommission state as an asset is wiped before being returned
{APP}_SHA	String	deploytool	The deploytool updates the SHA on the asset when an application (tumblr, config, etc) is deployed to it. Only assets with the current SHA take production traffic.

Status and state describe the current phase of the lifecycle of an asset.

Status

The lifecycle (from birth to death) of an asset are described in terms of its status. The possible status values are fixed and can not be managed via the API. While all available status values are listed below, the descriptions given are primarily indiciative of their meanings for a server. A non-server asset type such as a configuration may only ever be allocated or decommissioned, for instance. The status values described below also give some specific insight into the Tumblr intake process for hardware.

Incomplete: Host not yet ready for use. It has been powered on and entered in Collins but burn-in is likely being run
New
: Host has completed the burn-in process and is waiting for an onsite tech to complete physical intake
Unallocated
: Host has completed intake process and is ready for use
Provisioning
: Host has started provisioning process but has not yet completed it
Provisioned
: Host has finished provisioning and is awaiting final automated verification
Allocated
: This asset is in what should likely be considered a production state
Cancelled
: Asset is no longer needed and is awaiting decommissioning
Decommissioned
: Asset has completed the outtake process and can no longer be managed
Maintenance
: Asset is undergoing some kind of maintenance and should not be considered for production use

The status transition should not generally happen by hand. Automated processes should drive status changes, not people. In fact, the collins UI only allows you to change the status by taking an action (e.g. cancelling an asset or putting it into maintenance).

State

While the status of an asset describes where it is in a discrete lifecycle, the state describes a lifecycle specific to a status. For example, a server that is in maintenance may have a state of HARDWARE_PROBLEM or HARDWARE_UPGRADE. Also note that those states are not appropriate for healty (non-maintenance) assets, and so these states are restricted to assets with a status of Maintenance.

A state can be either a system state, or a non-system state. System states can not be modified or destroyed. Non-system states can be modified and destroyed. Via the API you can only create non-system states, although support for adding system states may be added in the future. A state can be bound to a status (such as the case of HARDWARE_PROBLEM), or can be used with any status (such as the case of RUNNING). The out of the box available states are described below.

States

Status	State Label	State Name	State Description
Any	Failed	FAILED	A service in this state has encountered a problem and may not be operational. It cannot be started nor stopped.
Any	New	NEW	A service in this state is inactive. It does minimal work and consumes minimal resources.
Any	Running	RUNNING	A service in this state is operational.
Any	Starting	STARTING	A service in this state is transitioning to Running.
Any	Stopping	STOPPING	A service in this state is transitioning to Terminated.
Any	Terminated	TERMINATED	A service in this state has completed execution normally. It does minimal work and consumes minimal resources.
Maintenance	Hardware Problem	HARDWARE_PROBLEM	An asset is experiencing a non-IPMI issue and needs to be examined. It needs investigation.
Maintenance	Hardware Testing	HW_TESTING	Performing some testing that requires putting the asset into a maintenance state.
Maintenance	Hardware Upgrade	HARDWARE_UPGRADE	An asset is in need or in process of having hardware upgraded.
Maintenance	IPMI Problem	IPMI_PROBLEM	An asset is experiencing IPMI issues and needs to be examined. It needs investigation.
Maintenance	Maintenance NOOP	MAINT_NOOP	Doing nothing, bouncing this through maintenance for my own selfish reasons.
Maintenance	Network Problem	NETWORK_PROBLEM	An asset is experiencing a network problem that may or may not be hardware related. It needs investigation.
Maintenance	Relocation	RELOCATION	An asset is being physically relocated.

More information about the states available in your collins instance can be found in the collins help.

An audit trail with an API

Every modification or lifecycle event that occurs with an asset is logged, along with who made the change and the time of the change. If a tag is modified (and not encrypted), the previous and new value are both stored. Logs can be searched via the API and can be viewed on the web as well. Logs are immutable but can be created via the API. Below is a list of log levels (based on syslog) and descriptions.

Level	Description
EMERGENCY	A "panic" condition - notify all tech staff on call? (earthquake? tornado?) - affects multiple apps/servers/sites...
ALERT	Should be corrected immediately - notify staff who can fix the problem - example is loss of backup ISP connection
CRITICAL	Should be corrected immediately, but indicates failure in a primary system - fix CRITICAL problems before ALERT - example is loss of primary ISP connection
ERROR	Non-urgent failures - these should be relayed to developers or admins; each item must be resolved within a given time
WARNING	Warning messages - not an error, but indication that an error will occur if action is not taken, e.g. file system 85% full - each item must be resolved within a given time
NOTICE	Events that are unusual but not error conditions - might be summarized in an email to developers or admins to spot potential problems - no immediate action required
INFORMATIONAL	Normal operational messages - may be harvested for reporting, measuring throughput, etc - no action required
DEBUG	Info useful to developers for debugging the application, not useful during operations
NOTE	Creates by users via the web UI, can be any kind of message

System logs (messages that aren't specific to any particular kind of asset) can only be logged internally by collins. Collins of course uses an asset to log these kinds of messages against. By default the system asset is the multicollins.thisInstance value, or tumblrtag1. You can specify the system asset via the features.syslogAsset configuration.

IPAM for engineers, API included

Collins has an IP Address Management (IPAM) system built into it. The IPAM system is used for allocating both IPMI addresses and typical addresses. Addresses are configured in pools (which typically correspond to a VLAN), but can also be configured to be pool-less in the case where you don't manage your own IP Address space.

Collins will prevent duplicate IP address allocation, and will almost always use the smallest available address in a range. It is possible to allocate an address against any kind of asset. This is sometimes useful for instance when managing a VIP (virtual or floating IP address). You can create a configuration asset that holds the VIP for a service, then link that asset to others that will actually share the address.

In addition to address allocation, collins provides the ability to do other things you would expect from a typical IPAM system such as querying used addresses, understanding what an IP space looks like, finding assets in a pool or by address, etc.

At Tumblr we combine the IPAM functionality of Collins with the per asset LLDP data to automatically manage switch provisioning. We also use this data for generating kickstart files with the correct address information.

The fundamental idea with Collins IPAM is that of a pool. A pool is a named group of addresses. A pool definition will specify the network address range (specified in CIDR notation), an optional start address (e.g. the IP to start allocating from in the specified range), an optional gateway (if it's not the one you would infer from the CIDR range), and a name. Once a pool is configured it is possible to allocate addresses in that pool. If you don't manage your own address space, no worries. You can operate in a a 'pool-less' mode where you can specify any address.

There is more information available in the API section as well as in the configuration section.

Collins Concepts

Overview

Assets

Tags

Tag Types

Dimensions

Tags

Managed Tags

General Information

CPU Information

Memory Information

Network Information

Disk Information

LLDP Information

Automated Tags

General Automated Tags

Tumblr Automated Tags

Status & State

Status

State

Logs

Addresses