Vectorizing Adjusted Close with Python

Aug 27, 2016

Adjusted prices are essential when working with historical stock prices. Any time there is a corporate split or dividend, all stock prices prior to that event need to be adjusted to reflect the change.

For one-off projects, it is easy enough to get the current adjusted close prices for any stock from Yahoo Finance or Quandl. It is convenient to let the provider handle all of the calculations of adjusting the prior prices. Using provided adjusted prices becomes impracticable when you are storing all prices for all symbols in a database. Not only would you have to store twice the number of values (the raw and adjusted price), but you would have to download the entire price history to update your adjusted prices. Alternatively, if you calculate your own adjusted prices, you only need to store the raw prices and download the latest prices for day to day database updates.

Calculating your own adjusted prices requires a few extra components in addition to the raw prices, including all historical corporate stock splits and cash dividends. To actually calculate the adjusted prices, you will need to have an efficient function to process the adjusted price formula.

Types of Corporate Actions

Adjusting historical prices after a stock split is required because the company is changing the number of shares being traded (shares outstanding). Since the market value (market value = shares outstanding x share price) remains the same in a stock split, the share price must changes by the inverse of the ratio that the shares outstanding change. Stock splits often occur when the share price is either very high or very low, partly due to the company trying to make the share price more pleasant (investor psychology). Common split ratios for decreasing the share price (and increasing the shares outstanding) include 1-to-2, 1-to-3, and 2-to-3 (although any combination can be used). An inverse of these ratios is used for increasing the share price and decreasing the shares outstanding.

Corporate cash dividends require a historical price adjustment due to a corporate cash disbursement. By distributing cash from the company to shareholders on a per share basis, the company is decreasing its market value. To account for this decrease in value, the share price will theoretically decrease in value exactly by the dividend value per share on the ex-dividend date. For small dividends, it is nearly impossible to distinguish the cash payments from normal market movements on the ex-dividend date. However, this cash distribution is noticeable when large dividends are paid (often when special dividends are paid). Corporate stock dividends do not require a historical price adjustment because the stock disbursement only changes the ownership of already existing shares (from the company's books to the shareholder).

Adjusted Price Formula

There is no standardized formula for calculating adjusted prices, however, there are two general approaches for calculating adjusted prices.

The simplistic approach is to subtract the dividend out of the stock price going back to the IPO, and then adjusting the dividend adjustment for stock splits along the way. The main problem with this approach is that due to inflation, it is possible for the adjusted prices to go below zero, causing all sorts of issues in subsequent calculations.

The second approach involves calculating adjustments in percentage terms instead of absolute dollar value. This avoids the negative adjusted stock prices but skews profit and loss calculations. Mathematically, this approach makes more sense to me. Plus, Yahoo Finance uses it in their adjusted price calculation, so you can cross-check your work with theirs.

In this old blog post from Armand, he derived the approach and formula that produces the adjusted prices exactly as Yahoo Finance shows.

The calculation for the adjusted close follows these steps:

For the latest available trading day, the actual closing price and the adjusted closing price are the same.
For every other day ("today"), determine what percentage today's closing price is over yesterday's closing price, excluding the effect of a dividend and/or a stock split, if today is the ex-dividend date.
Calculate yesterday's adjusted close as being the same percentage down from today's adjusted close as the percentage calculated in step 2.
Repeat steps 2 and 3 for all other historic days.

I used the same formula structure, but tweaked it to work with dates in descending order (newest to oldest). The formula looks like this:

$$ A_0 = A_1 + A_1 \frac{P_0 * S - P_1 - D_1}{P_1} $$

where:

$ A_0 $ is today's adjusted price

$ A_1 $ is tomorrow's adjusted price

$ P_0 $ is today's actual price

$ P_1 $ is tomorrow's actual price

$ S_1 $ is the split ratio for tomorrow (the ex-date)

$ D_1 $ is the actual dividend for tomorrow (the ex-date)

If there is no stock split, the ratio defaults to 1. For a 1-to-7 split, the split ratio would be 0.1429 $(\frac{1}{7} = 0.1429)$. If there is no dividend, the ratio defaults to 0.

The current day's adjusted prices are always the same as the raw prices. All dividends and splits are known ahead of time. Therefore, it is not an issue to have the formula based on future events.

Programming the Calculation

There are two ways to program this formula in Python. An iterative function and a vectorized function. Both functions use Pandas to organize and hold the data.

The function input variables include a Pandas DataFrame, and a string of the column that should have adjusted prices calculated for it (i.e. 'close'). Both functions return a DataFrame similar to the input DataFrame, but with an additional column for the new adjusted prices. All of my code is written in Python 3.5.

Preparing the Input DataFrame

The input DataFrame structure is very important. The functions expect that the DataFrame has four columns, with the price (i.e. 'close'), dividend and split_ratio columns being essential for the adjustment calculation. Also, the DataFrame index should be dates sorted oldest to latest.

I pulled these DataFrame values from the pySecMaster. Here is a snapshot of the input DataFrame for AAPL between 1/23/2015 and 2/6/2015, which shows both a stock split (1/26/2015) and a cash dividend (2/5/2015).

                tsid    close  dividend  split_ratio
date                                                
2015-01-23  AAPL.Q.0  112.980      0.00          1.0
2015-01-26  AAPL.Q.0   56.550      0.00          0.5
2015-01-27  AAPL.Q.0   54.570      0.00          1.0
2015-01-28  AAPL.Q.0   57.655      0.00          1.0
2015-01-29  AAPL.Q.0   59.450      0.00          1.0
2015-01-30  AAPL.Q.0   58.580      0.00          1.0
2015-02-02  AAPL.Q.0   59.315      0.00          1.0
2015-02-03  AAPL.Q.0   59.325      0.00          1.0
2015-02-04  AAPL.Q.0   59.780      0.00          1.0
2015-02-05  AAPL.Q.0   59.970      0.57          1.0
2015-02-06  AAPL.Q.0   59.465      0.00          1.0

Iterative Function

The iterative function first adds a new column for the adjusted price, and then fills it in with the latest raw price. A list of dates are extracted from the DataFrame, minus the latest date (all in reverse order). This list of dates is used in a for loop to iterate over the DataFrame (via .loc). By using the index dates, it is possible to select future or prior values that are on different DataFrame rows.

When all relevant formula components are identified, they are plugged into the adjusted price formula that I showed above. The adjusted price is then rounded to the fourth decimal place and added to the adjusted price column of the DataFrame.

Here is the Python code for this function:

def calculate_adjusted_prices_iterative(df, column):
    """ Iteratively calculates the adjusted prices for the specified column in
    the provided DataFrame. This creates a new column called 'adj_<column name>'
    with the adjusted prices. This function requires that the DataFrame have
    columns with dividend and split_ratio values.

    :param df: DataFrame with raw prices along with dividend and split_ratio
        values
    :param column: String of which price column should have adjusted prices
        created for it
    :return: DataFrame with the addition of the adjusted price column
    """
    adj_column = 'adj_' + column

    # Set default values for adjusted price column to 0
    df[adj_column] = 0

    # Get a list of the index dates, in reverse order (most recent to oldest)
    dates = list(df.index)[:-1]
    dates.reverse()
    final_date = list(df.index)[-1]

    # Put the column's last price as the last adj_<column>'s value
    df.loc[final_date, adj_column] = df.loc[final_date, column]

    # Iterate through each DataFrame row from bottom to top (newest to oldest)
    for date in dates:
        if dates.index(date) - 1 == -1:
            # If this is the first date in the index, use the final date
            #   variable, which is the newest date
            preceding_date = final_date
        else:
            # Otherwise, the next date variable will be the date preceding the
            #   current one (the next calendar day)
            preceding_date = dates[dates.index(date) - 1]

        current_val = df.loc[date, column]
        precede_val = df.loc[preceding_date, column]
        precede_adj = df.loc[preceding_date, adj_column]

        # Both the split ratio and dividend need to use the preceding date's
        #   values (the next calendar day's values)
        split_ratio = df.loc[preceding_date, 'split_ratio']
        dividend = df.loc[preceding_date, 'dividend']

        adj_price = (precede_adj + precede_adj *
                     (((current_val * split_ratio) - precede_val - dividend) /
                      precede_val))

        # Add the adjusted price to the adj_<column> in the DataFrame
        df.loc[date, adj_column] = round(adj_price, 4)

    return df

The problem with the iterative function is that it is very resource intensive because it has to loop through each row of the DataFrame. For stocks with decades of historical prices, this is very inefficient.

Testing AAPL's entire price history (12/12/1980 - 8/26/2016; 8948 periods), the iterative approach took a massive 12.616 seconds to complete. This is way too long for an essential calculation, especially when you need to prepare the adjusted price history for dozens or hundreds of symbols (which would take dozens of minutes, if not more). I thought of three ways to make this faster.

One way is to pre-calculate all the adjusted prices ahead of time and store the values in the database ready for use. The problems with this are that you have to store all these adjusted prices (doubling your database size), and you have to recalculate the entire adjusted price history any time the company has a new stock split or dividend. Admittedly, these corporate events happen about once a quarter, so it's not too bad, but we can do better.

The second band-aid to speed up the calculation is to multiprocess the function based on symbols. Each symbol being calculated past the first would be multiprocessed up to the CPU core count (or hyper-thread count). This is a respectable approach, but it bypasses the core of the problem; the function itself is inefficient.

The best fix for the iterative function is to scrap it and vectorize it (utilizing the efficiencies of NumPy). By vectorizing the adjusted price calculation, the function is able to perform all price calculations at the same time. Fancy!

Vectorized Function

Put simply, vectorization is the process of rewriting a loop so that instead of processing a single element of an array N times, multiple elements of the array are processed simultaneously.

Vectorizing the adjusted price calculation is possible because all of the values are already in an array. However, the vectorization here is not a textbook case because the adjusted price calculation relies on data from adjacent rows. Normally, vectorization uses only values from columns of the same row.

Through considerable trial and error, I eventually found a code structure that allowed for the vectorization that made this possible. It uses NumPy to facilitate the vectorization.

The vectorized function first separates the price, split and dividend DataFrame columns into individual NumPy representations of NDFrames (via .values). A NumPy array filled with zeros is also created for the calculated adjusted prices. The latest raw price is placed into the first array location.

For the actual price calculation, a for loop is used to iterate over the length of the array, minus the first item (the latest date). Since each NumPy array is indexed based on integer values, it is easy to reference prior array values by simply decreasing the index value. The calculation output is rounded to the fourth decimal place before being put into the adjusted price array.

Once all adjusted price values are added to the array, the NumPy array is placed into the original DataFrame.

Here is the Python code for this function:

def calculate_adjusted_prices(df, column):
    """ Vectorized approach for calculating the adjusted prices for the
    specified column in the provided DataFrame. This creates a new column
    called 'adj_<column name>' with the adjusted prices. This function requires
    that the DataFrame have columns with dividend and split_ratio values.

    :param df: DataFrame with raw prices along with dividend and split_ratio
        values
    :param column: String of which price column should have adjusted prices
        created for it
    :return: DataFrame with the addition of the adjusted price column
    """
    adj_column = 'adj_' + column

    # Reverse the DataFrame order, sorting by date in descending order
    df.sort_index(ascending=False, inplace=True)

    price_col = df[column].values
    split_col = df['split_ratio'].values
    dividend_col = df['dividend'].values
    adj_price_col = np.zeros(len(df.index))
    adj_price_col[0] = price_col[0]

    for i in range(1, len(price_col)):
        adj_price_col[i] = 
            round((adj_price_col[i - 1] + adj_price_col[i - 1] *
                   (((price_col[i] * split_col[i - 1]) -
                     price_col[i - 1] -
                     dividend_col[i - 1]) / price_col[i - 1])), 4)

    df[adj_column] = adj_price_col

    # Change the DataFrame order back to dates ascending
    df.sort_index(ascending=True, inplace=True)

    return df

The vectorized function has amazing performance. Compared to the iterative approach taking 12.616 seconds to complete, this function completed calculating all adjusted close prices for AAPL (12/12/1980 - 8/26/2016; 8948 periods) in 0.039 seconds(!). This is the performance I wanted. To calculate the adjusted close prices for a few thousand symbols, it will only take a few minutes instead of 30 minutes plus.

I did try to get better performance using Cython to vectorize the calculations, but I always got worse performance. Technically, Cython should be faster than NumPy due to the fact that it runs directly in C. I assume that my approach is incorrect for fully utilizing Cython.

Also, the NumPy vectorized function can be multiprocessed based on symbols for more performance. I have yet to implement this, but it would only require a multiprocess wrapper function to handle it.

Final output DataFrame

This is what the final DataFrame looks like with the adjusted close column at the end.

                tsid    close  dividend  split_ratio  adj_close
date                                                           
2015-01-23  AAPL.Q.0  112.980      0.00          1.0    55.9513
2015-01-26  AAPL.Q.0   56.550      0.00          0.5    56.0107
2015-01-27  AAPL.Q.0   54.570      0.00          1.0    54.0496
2015-01-28  AAPL.Q.0   57.655      0.00          1.0    57.1052
2015-01-29  AAPL.Q.0   59.450      0.00          1.0    58.8831
2015-01-30  AAPL.Q.0   58.580      0.00          1.0    58.0214
2015-02-02  AAPL.Q.0   59.315      0.00          1.0    58.7494
2015-02-03  AAPL.Q.0   59.325      0.00          1.0    58.7593
2015-02-04  AAPL.Q.0   59.780      0.00          1.0    59.2100
2015-02-05  AAPL.Q.0   59.970      0.57          1.0    59.9700
2015-02-06  AAPL.Q.0   59.465      0.00          1.0    59.4650

Conclusion

I hope that you were able to glean some insight into calculating adjusted prices for any historical stock price set. I know there are a few other resources on the internet for this, but I never found a resource that covered the performance side of getting a fast calculation.

If you find a better way to calculate adjusted stock prices, please do share. I'd love to have a faster function for handling this.