SQL and Investment Performance Measurement: Setting Up

Part I: Setting Up a Database

This blog is about how to process investment performance using SQL and a database. For professional use, there are a number of systems available for calculating the numbers, but even then understanding the mechanics may be interesting and/or useful. Also, having total control over the process may enable choices that are not available in the system used, for example on how to handle derivatives, how to set up the portfolio structure, or how and which risk or other external data to use if calculating risk based statistics.

It is obviously possible to just download the data, as it is, from the database to Excel or a Python application or similar, and perform the calculations there. While that does have some advantages, it is often more efficient to process the results from the underlying large amount of data in the database server, and only bring the required results over to the client application.

At the time writing, I intend to do the following, with pure SQL:
  1. Set up a database with sample data for a portfolio and a benchmark.
  2. Calculate daily performance time series using a portfolio structure and the sample data.
  3. Calculate daily Brinson-Fachler performance attribution, with CariƱo scaling, producing cumulative attribution effect time series.
  4. Adding an Excel front end to look at the results (well, this is not SQL...).
All the code can be found in my GitHub repository.

A Database is Easy to Set Up

First, we need a database. A relational SQL database, that is. I am using Microsoft SQL Server, and while there is an ISO/IEC/ANSI standard for SQL, other SQL databases like Oracle, PostgreSQL, MariaDB, etc. all have their differences when it comes to details, so you might need to tweak the code to work with your setup.

I you do not have a database already, Microsoft conveniently offers the Express Edition of MS SQL Server, a free edition that can be used for development and learning purposes. It is also relatively simple to set up and get running. So, no need for a professional DBA just yet!

Besides a database, we need some sort of user interface to work with it. If you just installed the MS SQL Server, install also Microsoft SQL Server Management Studio (SSMS). Once you get that running, connect to the database just created, and you are ready to go. 

Start with a Schema

First, let's set up a schema for the project. Schemas are important in providing structure to the database; in this instance, we can use a schema to separate this project from other database objects. Schemas are also very useful in handling database access privileges, providing a way to grant or revoke user access to the whole schema and all its objects. All that's needed is

create schema PerfTest authorization dbo;

At this stage, we do not worry about giving other users access privileges to our new schema (called PerfTest). If the above statement returned an error message, you do not have sufficient privileges yourself, and need to get them before moving forward.

Tables are Served

Data must be stored in tables. A normal* table in a relational database is a collection of rows of data, each row consisting of a number of columns. The columns each have a data type, e.g. date, integer, or varchar(50). Usually, tables also have a primary key to define which columns together form a unique identifier for each row, and that primary key most often also is the clustered index of the table, meaning that data rows will be actually stored, on disk, in that order. (It very rarely is a good idea not to have a primary key for a table.) In addition, the table can have foreign keys, constraints, and indexes. I will not go into triggers for the moment, but statistics are mostly automatically handled and help queries to find efficient ways of getting the data needed from the table.

A foreign key is a reference to another table. This is a way of making sure we e.g. only have portfolio buy and sell transactions in the Transact table for portfolios actually existing in the Portfolio table. 

A constraint simply sets limits to the values that will be accepted for a column.

(* in SS there are some somewhat different table constructs like the columnstore and in-memory variants.)

Indexing is Important

Indexes are very important, especially when tables grow larger. Think of an address book table with a million names and addresses, having columns LastName, FirstName, StreetAddress, and City. If the primary key of that table would be (LastNameFirstName), the data would be stored in that order (and would be limited to each person stored having a unique name!) For the database, it would be easy to find everyone with LastName "Hunter", say 100 people and rows, as the table rows would be in that alphabetical order. The database would use table statistics to go to roughly the right part of the table, and read only a very small portion of the table rows - perhaps slightly more than a hundred, but not very much more. However, finding everyone with FirstName "John" would require reading every single row of the table, and only selecting the Johns for the output. 

If it would be a frequent need to find people based on their first names, an index on FirstName would store all FirstNames in alpabetical order, and make finding the Johns as easy as finding the Hunters. One major difference between the two would still exist: If you also wanted to have the actual addresses of the Johns and the Hunters, the database would have to do a lot more work for the Johns. Why? Because the primary key, being the clustered index, represents also the actual table data and, hence, contains all the columns. The index for the FirstName column contains only Firstname and the remaining primary key columns (in this case, LastName). In effect, when using the index on FirstName, the database would find the combinations of FirstName and LastName where FirstName equals "John", and then use those to go and find the rows in the primary key, with the required address information. This operation is called key lookup and it is often quite costly (i.e. slow) to perform. 

To overcome key lookups, we can include columns to be stored with the index. If we included StreetAddress and City to the index, just finding the Johns in it would suffice, no key lookups necessary. The downside - perhaps obvious - is that every index and all their columns have to be stored. By creating an index on FirstName and including all table columns in it, we added the capability of quickly finding people and their addresses based on their FirstNames, but in effect we are now storing the table twice. And what about searches on City, or StreetAddress? Adding indexes is using disk space for speed, in a way. Another downside is that every time the table data are updated (or added / deleted), also the indexes need to be updated, making updates take longer.

An index can also be defined unique, in which case it will not allow any duplicate values (all index columns having equal values) or filtered, covering only a selected part of the table rows. 

The above also emphasizes the importance of the primary key. The columns of the primary key are usually defined by the business logic of the table - it is often rather obvious which columns form a unique key. However, their respective order in the primary key should be such that it a) supports queries and b) supports maintaining table data. If you have a table where you add data daily to, like the Price table below could be, having PriceDate as the first column of the primary key makes all new data being added to the end of the table. If, on the other hand, InstrumentIK were the first column of the primary key, each day you would be adding data all over the table. Both may have their advantages when querying, but the latter has a clear disadvantage in causing table fragmentation when always adding data in between existing rows.

Tables for Performance Measurement

I have set up seven tables to hold all the data needed for this project. To keep things a little simpler, I have used only partial database normalization to limit the number of tables. The tables are:
  • Calendar
  • FXRate
  • Instrument
  • Portfolio
  • Price
  • Structure
  • Transact
Each will be described below. In GitHub, there's a complete script of setting these up and populating them with test data. Of course, you could use your own data instead.

Calendar

A calendar table is often very useful when working with dates and especially time series of data. It could include a number of different calendars, but I have limited this table to only one, containing the business days I am interested in. I added previous business day and next business day columns, as these often help many date operations.

create table PerfTest.Calendar (
	ToDate date not null,
	PrevToDate date null,
	NextToDate date null,
	constraint PK_Calendar primary key (
		ToDate asc
	)
);

As you can see, ToDate is the only column in the primary key, and hence the table only allows unique values of ToDate.

FXRate

Foreign exchange rates are needed solely for converting foreign currency values to portfolio currency. I have defined it so that there can be many base currencies; in the script I have only used base currency EUR, though.

create table PerfTest.FXRate (
	PriceDate date not null,
	Currency char(3) not null,        -- Currency is the base currency. E.g. EUR for EUR investors.
	PriceCurrency char(3) not null,   -- PriceCurrency is the currency of the price, market value etc. to be converted to Currency.
	FXRate numeric(19, 8) not null,   -- FXRate is expressed so that e.g. Price [in PriceCurrency] / FXRate = Price [in Currency]. 
	constraint PK_FXRate primary key (
		PriceDate asc,
		Currency asc,
		PriceCurrency asc
	)
);

Instrument

The instrument table contains the details of the instruments, or securities, including cash accounts. In this testing environment there are a very limited number of instrument properties, like AssetClass, Currency, and Sector. If data were fully normalized, each of these (except the name columns) would have their own table, and this table would only contain a reference to it. Or, as properties like Sector may change over time, there could also be separate tables just for linking the instruments to these properties, with date column(s) defining the period when a property was active.

create table PerfTest.Instrument (
	InstrumentIK int not null,
	ShortName varchar(16) not null,
	LongName varchar(100) not null,
	Currency char(3) not null,
	Country char(2) not null,
	Region varchar(50) not null,
	AssetClass varchar(50) not null,
	Sector varchar(50) not null,
	constraint PK_Instrument primary key (
		InstrumentIK asc
	)
);
create index Instrument_IX#ShortName on PerfTest.Instrument (ShortName asc);
create index Instrument_IX#LongName on PerfTest.Instrument (LongName asc);

I have also added two indexes to the table, enabling a fast search of an instrument by its name.

Portfolio

Portfolios could have a lot of properties, but I have added only name and currency. The unique indexes enable fast searching, but also enforce a data validation rule that prevents entering two portfolios with the same name into the table. In this test environment, benchmarks are also portfolios.

create table PerfTest.Portfolio (
	PortfolioIK int not null,
	ShortName varchar(16) not null,
	LongName varchar(100) not null,
	Currency char(3) not null,
	constraint PK_Portfolio primary key (
		PortfolioIK asc
	)
);
-- unique indexes to not allow duplicate names
create unique index Portfolio_UQ#ShortName on PerfTest.Portfolio (ShortName asc);
create unique index Portfolio_UQ#LongName on PerfTest.Portfolio (LongName asc);

Price

Instrument prices table could have a lot more data, like storing different prices (bid, ask, valuation at different times of day, etc.) but as can be seen from its primary key, I have limited the data to one price per day for each instrument. The first column of the primary key is PriceDate, so new data will mostly be added to the end of the table, helping limit its fragmentation.

create table PerfTest.Price (
	PriceDate date not null,
	InstrumentIK int not null,
	PriceCurrency char(3) not null,
	Price numeric(19, 6) not null,
	constraint PK_Price primary key (
		-- only one Price per PriceDate accepted.
		PriceDate asc,
		InstrumentIK asc
	),
	constraint FK_Price_Instrument#InstrumentIK foreign key (InstrumentIK) references PerfTest.Instrument (InstrumentIK)
);
-- index to enable quick use of instrument price time series
create unique index Price_UQ#InstrumentIK#PriceDate on PerfTest.Price (InstrumentIK asc, PriceDate asc) include (PriceCurrency, Price);

The primary key also enables getting cross sectional price data efficiently, e.g. when looking at portfolio holdings on a specific day. There is, however, also an index (InstrumentIK, PriceDate), which helps quickly accessing a time series of any instrument, especially as the PriceCurrency and Price columns are included in the index. While the table will not suffer that much from fragmentation, this index will, as data are frequently added between existing rows. It is a good idea to have automated database maintenance routines to look for and rebuild fragmented tables and indexes.

Also note the foreign key constraint. This basically says that any InstrumentIK in this table must be found in table Instrument as well. Trying to enter price data for an InstrumentIK not found in the Instrument table, or to delete an instrument in the Instrument table without first deleting its prices, would cause an error. Instruments must be created before adding prices, and prices deleted before deleting instruments.

Structure

To analyze portfolio performance, it is important to follow the decision making structure. In a traditional framework, perhaps best applicable to asset allocation and equity portfolios, each step in decision making is independent of the others. For example, an equity portfolio manager might first decide the regional allocation of the portfolio. Next, within each region, she would make decisions on sector allocation; and finally, within each sector in each region, select the stocks and their weights. To this process we can easily apply the Brinson-Fachler performance attribution methodology. (In contrast, if the decisions made do not follow a clear tree-type structure with independent decision, this kind of analysis might be prone to errors. Fixed income attribution typically needs a different approach.)

To define the decision making structure, I have set up a simple table defining the decision making levels. This is, again, a simplified view; it would add flexibility if the structure would allow different decision making levels for each branch of the tree. An example could be to start with asset class allocation as the first level, but then have different second levels for each asset class, like sector for equities, but currency for bonds. This kind of a recursive tree structure would be rather easy to define, but somewhat more complex to use.

create table PerfTest.Structure (
	StructureIK smallint not null,
	ShortName varchar(16) not null,
	LongName varchar(100) not null,
	Level1 varchar(20) not null,
	Level2 varchar(20) null,
	Level3 varchar(20) null,
	Level4 varchar(20) null,
	LevelCnt as case when Level4 is not null then 4 when Level3 is not null then 3 when Level2 is not null then 2 else 1 end persisted,
	constraint PK_Structure primary key (
		StructureIK asc
	),
	-- only these values accepted to define a structure.
	constraint CHK_Structure#Level1 check (Level1 in ('AssetClass','Currency','Country','Region','Sector','Instrument')),
	constraint CHK_Structure#Level2 check ((Level2 in ('AssetClass','Currency','Country','Region','Sector','Instrument') and Level1 <> 'Instrument') or (Level2 is null and Level3 is null and Level4 is null)),
	constraint CHK_Structure#Level3 check ((Level3 in ('AssetClass','Currency','Country','Region','Sector','Instrument') and Level2 <> 'Instrument') or (Level3 is null and Level4 is null)),
	constraint CHK_Structure#Level4 check ((Level4 in ('AssetClass','Currency','Country','Region','Sector','Instrument') and Level3 <> 'Instrument') or Level4 is null)
);
-- unique indexes to not allow duplicate names
create unique index Structure_UQ#ShortName on PerfTest.Structure (ShortName asc);
create unique index Structure_UQ#LongName on PerfTest.Structure (LongName asc);

The Structure table includes a computed column LevelCnt. This is a column that you cannot enter data into; its value is determined based on some other column(s) of the table. The definition here basically just checks how many levels are actually used by each structure, so the result is always a number between 1 and 4. This computed column is marked persisted, which simply means the result is actually stored on disk; otherwise, it would be calculated every time when used in a query (again a choice between disk space and speed).

The four constraints make sure the data only uses some pre-defined values. Otherwise, someone might try to use something not supported (e.g. Duration) as a Level. The constraints also make sure that Instrument, if used, is the final level, and that there are no missing (null) levels between levels.

Transact

Portfolio buy and sell trades are stored in Transact table ("transaction" has a very specific meaning in SQL and is, for that reason, not used as a table name).

This is a very simplified data set to keep trades in - it basically stores the trade date, units, and value for each trade (including PortfolioIK and InstrumentIK, obviously). In addition, the value can be given in any currency, and a CashInstrumentIK is used to give the cash account of the trade.

create table PerfTest.Transact (
	TransactIK int not null,
	TradeDate date not null,
	PortfolioIK int not null,
	InstrumentIK int not null,
	CashInstrumentIK int null,
	Units numeric(29,6) null,
	Currency char(3) not null,
	CurrencyValue numeric(19,2) not null,
	Comment varchar(200) null,
	constraint PK_Transact primary key (
		TransactIK asc
	),
	-- foreign keys to ensure consistency with Portfolio and Instrument
	constraint FK_Transact_Portfolio#PortfolioIK foreign key (PortfolioIK) references PerfTest.Portfolio (PortfolioIK),
	constraint FK_Transact_Instrument#InstrumentIK foreign key (InstrumentIK) references PerfTest.Instrument (InstrumentIK),
	constraint FK_Transact_Instrument#CashInstrumentIK foreign key (CashInstrumentIK) references PerfTest.Instrument (InstrumentIK)
);
create index Transact_IX#TradeDate on PerfTest.Transact (TradeDate asc);
create index Transact_IX#PortfolioIK#TradeDate on PerfTest.Transact (PortfolioIK asc, TradeDate asc);
create index Transact_IX#InstrumentIK#TradeDate on PerfTest.Transact (InstrumentIK asc, TradeDate asc) include (PortfolioIK);
create index Transact_IX#CashInstrumentIK#TradeDate on PerfTest.Transact (CashInstrumentIK asc, TradeDate asc) include (PortfolioIK) where (CashInstrumentIK is not null);

There is no unique business key in this table - it can be quite common to have two trades with the same properties executed the same day. Hence, a surrogate key (or technical key) TransactIK is used. This can simply be a number sequence starting from 1 and adding 1 for each new table row. (There is automated functionality in SQL to provide this as well, although not used here.)

Again, constraints make sure all portfolio and instrument references are valid, and a number indexes are provided for faster access to table data. 

Data, please

With the tables now set up, all we need to do is populate them with data. There are many ways this could be done; I have provided simple SQL insert into ... values (...) statements with some sample data in GitHub.

Next steps

Next, I will provide the SQL scripts for calculating portfolio daily holdings, with market values and cash flows, and time series of portfolio and benchmark returns using the structures. To help using the structures, I will also build an indexed view for them. See Part II.

Comments

Popular posts from this blog

Calculating IRR, DPI, RVPI, and TVPI, Part III

Calculating IRR, DPI, RVPI, and TVPI, Part I

Calculating IRR, DPI, RVPI, and TVPI, Part II