Collins Concepts

Stuff to know before getting started

The collins data model

Collins was designed from the beginning to represent assets in the simplest way possible. This simplicity makes for an efficient data-model and allows for a large degree of flexibility. Collins really only knows about a few different kinds of things. Collins has:

  • Assets
  • Status & States
  • Tags
  • Logs
  • Addresses

That's it. Everything you need to know about collins has to do with one of these types of things.

An asset in collins describes a thing, usually a piece of hardware or a configuration.

Assets themselves are very simple internally and only have a handful of meaningful attributes. An asset consists of a:

tag
A tag is a unique, immutable, alphanumeric sequence of characters associated with the asset. These are arbitrary but nearly all API calls for interacting with assets will reference this tag.
status
Indicates the current lifecycle phase of the asset. More details in the next section.
state
Indicates a phase (status) specific state of the asset. More details in the next section.
type
Current supported types include: Configuration, Data Center, Power Circuit, Power Strip, Rack, Router, Server Node and Switch. Of primary interest to most users will be server nodes and configurations.

New asset types are fairly easy to create but can not be created via the API (although this is on the roadmap). Status values are not generally added/changed, typically a new state is used instead.

Multi-dimensional key/values for an asset

The most fundamentally important concept in collins is that of a tag. A tag consists of a key (such as TOTAL_DISK_SPACE_BYTES), a value (such as 536870912000) and a dimension (such as 0). Tags are what provide most of the flexibility within collins. If you set a new tag on an asset and that tag does not yet exist, it will be available to query on and immediately available to query on. If you set a new tag value using the same dimension, the old value will be overwritten. All tag changes, including deletes, are logged as part of the asset audit trail.

Tag Types

Tags can be either managed, automated, or unmanaged.

Managed
Tags that are not user managable (such as CPU speed) and can only be set or changed during certain lifecycle events, such as the server intake process. These tags can not be changed, updated or deleted by a user. This is enforced in the software.
Automated
Tags that are managed by external automated processes such as server provisioning. Although these can technically be set/changed by users via the API, convention dictates that they should not be.
Unmanaged
Tags set by a user via some api interaction (most likely), but probably on a manual basis. In some cases these may be part of an automated process, but some user-interaction will dictate the tag value. These are typically used for things like notes, links, testing, etc.

Dimensions

A dimension can be any integer value (including a timestamp) and is typically used for grouping things together. You can imagine the disks of a host being described like:

{
    "0": {
      "DEV_NAME": "/dev/sda",
      "TOTAL_DISK_SPACE_BYTES": 536870912000,
      "DISK_TYPE": "SATA"
    },
    "1": {
      "DEV_NAME": "/dev/sdb",
      "TOTAL_DISK_SPACE_BYTES": 0,
      "DISK_TYPE": "CD-ROM"
    }
}

This information tells you that the asset has two disks (one sata, one cd-rom) and tells you a bit about each physical disk. The value of the dimension in this case is just to logically group things together, and possibly to provide some ordering. You could also however use the dimension to represent things like port allocation (i.e. each dimension represents a port in use on the asset), or order of some arbitrary event where the timestamp is the dimension used. It's up to you.

Tags

The flexible nature of the tag approach used with Collins dictates that agreed upon convention is required.

Managed Tags

Most hardware information is gleaned from LSHW and most networking information is taken from LLDP. The usefulness of the hardware information is dependent on the version of LSHW (recent versions get better information). The LLDP information in SoftLayer is entirely faked by converting the output from `ifconfig` to an LLDP compatible format.

Managed tags automatically use an additional dimension if required. For example, say you have two memory sticks and we're tracking a description and size for each stick. Collins will create two 'groups', one that represents each memory stick. This is how collins differentiates between the same tag being used multiple times for the same asset.

General Information

All of these are specified as HTTP parameters during the intake process. These are not inferred, but specified.

TagTypeDescription
SERVICE_TAG String Vendor service tag (e.g. dell tag)
CHASSIS_TAG String Manufacturer serial number
RACK_POSITION String Physical position in the rack
POWER_PORT_A String The port in the CDU that the asset is attached to. The name and value will depend on your power configuration (detailed in configuration)

CPU Information

Collins assumes that each CPU is identical so only uses one dimension to represent all CPU's. Collins creates the visual hardware representation based on the CPU_COUNT. This information is taken from LSHW.

TagTypeDescription
CPU_COUNT Integer Physical CPU count in machine
CPU_CORES Integer Number of cores per physical CPU
CPU_THREADS Integer Number of threads per core
CPU_SPEED_GHZ Float Speed of CPU
CPU_DESCRIPTION String Vendor description or label

Memory Information

Collins represents a stick of RAM using the dimension (dictates bank used), MEMORY_SIZE_BYTES and MEMORY_DESCRIPTION. Each dimension represents a stick of RAM present in an asset. Collins creates the visual representation of the RAM layout by using the MEMORY_BANKS_TOTAL and filling in the sticks that are missing. No additional data is stored on missing RAM. This information is taken from LSHW.

TagTypeDescription
MEMORY_SIZE_BYTES Long Size of single RAM module
MEMORY_DESCRIPTION String Vendor description or label
MEMORY_SIZE_TOTAL Long Total number of bytes (MEMORY_SIZE_BYTES * number of modules)
MEMORY_BANKS_TOTAL Integer Total number of RAM banks

Network Information

Collins represents a NIC with its NIC_SPEED, MAC_ADDRESS and NIC_DESCRIPTION. Each dimension represents a different physical network card. This information is taken from LSHW.

TagTypeDescription
NIC_SPEED Long Speed in bits per second
MAC_ADDRESS String MAC address of network interface
NIC_DESCRIPTION String Vendor description or label

Disk Information

Collins represents a disk with its DISK_SIZE_BYTES, DISK_TYPE and DISK_DESCRIPTION. Each dimension represents a different physical disk. This information is taken from LSHW.

TagTypeDescription
DISK_SIZE_BYTES Long Size in bytes of disk. This will occasionally report as 0 when the disk is part of a RAID group, or in the case of a CD-ROM. The number of physical disks and storage total will still be accurate in this case.
DISK_TYPE Enum Type of disk. IDE, SCSI, PCIe, or CD-ROM.
DISK_DESCRIPTION String Vendor description or label
DISK_STORAGE_TOTAL String Sum of size for all disks. Because this number of often larger than a Long, it is stored as a String

LLDP Information

Collins represents each interface connected to an LLDP capable device using all of the information below. Each dimension represents a different physically connected network interface. Each interface may have zero or more VLAN's. This information is taken from LLDP.

  • LLDP_INTERFACE_NAME
  • LLDP_CHASSIS_NAME
  • LLDP_CHASSIS_ID_TYPE
  • LLDP_CHASSIS_ID_VALUE
  • LLDP_CHASSIS_DESCRIPTION
  • LLDP_PORT_ID_TYPE
  • LLDP_PORT_ID_VALUE
  • LLDP_PORT_DESCRIPTION
  • LLDP_VLAN_ID
  • LLDP_VLAN_NAME

Automated Tags

This data should be populated by largely automated processes. The list below is going to be fairly Tumblr specific with some exceptions. I have tried to denote tags that are likely Tumblr specific. If you use Collins as a fact/node terminus with puppet (or whatever system you use), this data can be useful for driving automation. I have tried to denote what Tumblr processes populate this data, but this is likely not very meaningful outside of Tumblr.

General Automated Tags

Non-tumblr specific automated tags, although some of the tools may be Tumblr specific.

TagTypePopulated ByDescription
HOSTNAME String visioner Populated when an asset successfully starts the provisioning process
NODECLASS String visioner Populated when an asset successfully starts the provisioning process. This is used by the puppet master to resolve the nodeclass for a host.
CONTACT String collins When a user provisions an asset the specified CONTACT is populated. This is used for notification of maintenance, provisioning being done, etc.
PRIMARY_ROLE String collins When a user provisions an asset the chosen PRIMARY_ROLE is populated. This is used to communicate the intended purpose of the asset (CACHE, DATABASE, HADOOP, PROXY, WEB, etc)
POOL String collins Populated along with PRIMARY_ROLE, this tag is used to communicate the functional pool the asset should participate in, e.g. MEMCACHE_POST_POOL
SECONDARY_ROLE String collins Populated along with PRIMARY_ROLE, this tag is used to communicate some secondary role such as MASTER, SLAVE, or a deployment group (such as WEB_A, WEB_B).
SYSTEM_USERNAME String visioner Set by visioner when a host is provisioned. In SoftLayer, reconciler populates this data as we pull it from the SL API. Typically 'root'.
SYSTEM_PASSWORD String visioner Set by visioner when a host is provisioned. In SoftLayer, reconciler populates this data as we pull it from the SL API. Typically 'root'.
DATACENTER_NAME Enum Invisible Touch Set invisible touch when a host goes through intake. Largely unused since we use multicollins deployments.
SYSTEM_COST Integer reconciler Populated by reconciler in SoftLayer, populated by hand in EWR
CANCEL_TICKET Long collins If the SoftLayer plugin is configured and an asset is cancelled, this value is populated

Tumblr Automated Tags

Tumblr specific automated tags

TagTypePopulated ByDescription
POWER_STATUS String decommissioner Tracks decommission state as an asset is wiped before being returned
{APP}_SHA String deploytool The deploytool updates the SHA on the asset when an application (tumblr, config, etc) is deployed to it. Only assets with the current SHA take production traffic.

Status and state describe the current phase of the lifecycle of an asset.

Status

The lifecycle (from birth to death) of an asset are described in terms of its status. The possible status values are fixed and can not be managed via the API. While all available status values are listed below, the descriptions given are primarily indiciative of their meanings for a server. A non-server asset type such as a configuration may only ever be allocated or decommissioned, for instance. The status values described below also give some specific insight into the Tumblr intake process for hardware.

Incomplete
Host not yet ready for use. It has been powered on and entered in Collins but burn-in is likely being run
New
Host has completed the burn-in process and is waiting for an onsite tech to complete physical intake
Unallocated
Host has completed intake process and is ready for use
Provisioning
Host has started provisioning process but has not yet completed it
Provisioned
Host has finished provisioning and is awaiting final automated verification
Allocated
This asset is in what should likely be considered a production state
Cancelled
Asset is no longer needed and is awaiting decommissioning
Decommissioned
Asset has completed the outtake process and can no longer be managed
Maintenance
Asset is undergoing some kind of maintenance and should not be considered for production use

The status transition should not generally happen by hand. Automated processes should drive status changes, not people. In fact, the collins UI only allows you to change the status by taking an action (e.g. cancelling an asset or putting it into maintenance).

State

While the status of an asset describes where it is in a discrete lifecycle, the state describes a lifecycle specific to a status. For example, a server that is in maintenance may have a state of HARDWARE_PROBLEM or HARDWARE_UPGRADE. Also note that those states are not appropriate for healty (non-maintenance) assets, and so these states are restricted to assets with a status of Maintenance.

A state can be either a system state, or a non-system state. System states can not be modified or destroyed. Non-system states can be modified and destroyed. Via the API you can only create non-system states, although support for adding system states may be added in the future. A state can be bound to a status (such as the case of HARDWARE_PROBLEM), or can be used with any status (such as the case of RUNNING). The out of the box available states are described below.

StatusState LabelState NameState Description
Any Failed FAILED A service in this state has encountered a problem and may not be operational. It cannot be started nor stopped.
Any New NEW A service in this state is inactive. It does minimal work and consumes minimal resources.
Any Running RUNNING A service in this state is operational.
Any Starting STARTING A service in this state is transitioning to Running.
Any Stopping STOPPING A service in this state is transitioning to Terminated.
Any Terminated TERMINATED A service in this state has completed execution normally. It does minimal work and consumes minimal resources.
Maintenance Hardware Problem HARDWARE_PROBLEM An asset is experiencing a non-IPMI issue and needs to be examined. It needs investigation.
Maintenance Hardware Testing HW_TESTING Performing some testing that requires putting the asset into a maintenance state.
Maintenance Hardware Upgrade HARDWARE_UPGRADE An asset is in need or in process of having hardware upgraded.
Maintenance IPMI Problem IPMI_PROBLEM An asset is experiencing IPMI issues and needs to be examined. It needs investigation.
Maintenance Maintenance NOOP MAINT_NOOP Doing nothing, bouncing this through maintenance for my own selfish reasons.
Maintenance Network Problem NETWORK_PROBLEM An asset is experiencing a network problem that may or may not be hardware related. It needs investigation.
Maintenance Relocation RELOCATION An asset is being physically relocated.

More information about the states available in your collins instance can be found in the collins help.

An audit trail with an API

Every modification or lifecycle event that occurs with an asset is logged, along with who made the change and the time of the change. If a tag is modified (and not encrypted), the previous and new value are both stored. Logs can be searched via the API and can be viewed on the web as well. Logs are immutable but can be created via the API. Below is a list of log levels (based on syslog) and descriptions.

LevelDescription
EMERGENCY A "panic" condition - notify all tech staff on call? (earthquake? tornado?) - affects multiple apps/servers/sites...
ALERT Should be corrected immediately - notify staff who can fix the problem - example is loss of backup ISP connection
CRITICAL Should be corrected immediately, but indicates failure in a primary system - fix CRITICAL problems before ALERT - example is loss of primary ISP connection
ERROR Non-urgent failures - these should be relayed to developers or admins; each item must be resolved within a given time
WARNING Warning messages - not an error, but indication that an error will occur if action is not taken, e.g. file system 85% full - each item must be resolved within a given time
NOTICE Events that are unusual but not error conditions - might be summarized in an email to developers or admins to spot potential problems - no immediate action required
INFORMATIONAL Normal operational messages - may be harvested for reporting, measuring throughput, etc - no action required
DEBUG Info useful to developers for debugging the application, not useful during operations
NOTE Creates by users via the web UI, can be any kind of message

System logs (messages that aren't specific to any particular kind of asset) can only be logged internally by collins. Collins of course uses an asset to log these kinds of messages against. By default the system asset is the multicollins.thisInstance value, or tumblrtag1. You can specify the system asset via the features.syslogAsset configuration.

IPAM for engineers, API included

Collins has an IP Address Management (IPAM) system built into it. The IPAM system is used for allocating both IPMI addresses and typical addresses. Addresses are configured in pools (which typically correspond to a VLAN), but can also be configured to be pool-less in the case where you don't manage your own IP Address space.

Collins will prevent duplicate IP address allocation, and will almost always use the smallest available address in a range. It is possible to allocate an address against any kind of asset. This is sometimes useful for instance when managing a VIP (virtual or floating IP address). You can create a configuration asset that holds the VIP for a service, then link that asset to others that will actually share the address.

In addition to address allocation, collins provides the ability to do other things you would expect from a typical IPAM system such as querying used addresses, understanding what an IP space looks like, finding assets in a pool or by address, etc.

At Tumblr we combine the IPAM functionality of Collins with the per asset LLDP data to automatically manage switch provisioning. We also use this data for generating kickstart files with the correct address information.

The fundamental idea with Collins IPAM is that of a pool. A pool is a named group of addresses. A pool definition will specify the network address range (specified in CIDR notation), an optional start address (e.g. the IP to start allocating from in the specified range), an optional gateway (if it's not the one you would infer from the CIDR range), and a name. Once a pool is configured it is possible to allocate addresses in that pool. If you don't manage your own address space, no worries. You can operate in a a 'pool-less' mode where you can specify any address.

There is more information available in the API section as well as in the configuration section.