Ever since I was really little, I was always fascinated by the stock market. These days, I trade options on a daily basis, but I'm still interested in strategies for stock investing. In this post, I'm going to look at how the number of stocks in a portfolio affects performance.

First, it is important to understand that the stock market is unpredictable. It is not just difficult, but actually impossible to predict. No amount of fundamental analysis, technical analysis, or crowdsourced machine learning will ever allow you to predict the markets.

Given that we can't predict the market, how should we go about investing? What variables do we have to consider? There is a positive drift to the markets, and long term data show that the stock markets outperform other asset classes, albeit with greater risk. Here, we'll look at one variable, the number of stocks in your portfolio, and how it affects your risk and returns.

Data¶

The data for this study is the free data available from Quantquote:
http://quantquote.com/files/quantquote_daily_sp500_83986.zip
The zip file contains a separate file for each stock in the S&P 500. Each file contains daily price data from 1998 to mid 2013. Let's look at a sample from the IBM file.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
%matplotlib inline
sns.set_style('white')

ibm = pd.read_csv('./quantquote_daily_sp500_83986/daily/table_ibm.csv',
                   names=['date', 'adjustment', 'open', 'high', 'low', 'close', 'volume'],
                   parse_dates=['date'], index_col='date')

ibm.head()

Out[1]:

	adjustment	open	high	low	close	volume
date
1998-01-02	0	43.4664	43.9614	43.2834	43.9365	5.980096e+06
1998-01-05	0	43.9073	44.4272	43.6744	44.1694	1.167530e+07
1998-01-06	0	43.8325	44.5853	43.7784	43.9073	8.377999e+06
1998-01-07	0	43.6744	43.6744	42.6886	43.3874	1.004552e+07
1998-01-08	0	43.0505	43.8823	42.8176	43.4165	9.394955e+06

Each row contains a date, adjustment, open, high, low, and close prices, and volume. Adjustment refers to split/dividend adjustments, but I did not find anything other than zero in any of the files.

For this post, I will be looking at returns over different periods of time, which we can calculate using pandas' shift function.

In [2]:

ibm['daily_return'] = 100 * (ibm.close / ibm.close.shift(1) - 1)
ibm.fillna(0, inplace=True)
ibm.head()

Out[2]:

	adjustment	open	high	low	close	volume	daily_return
date
1998-01-02	0	43.4664	43.9614	43.2834	43.9365	5.980096e+06	0.000000
1998-01-05	0	43.9073	44.4272	43.6744	44.1694	1.167530e+07	0.530083
1998-01-06	0	43.8325	44.5853	43.7784	43.9073	8.377999e+06	-0.593397
1998-01-07	0	43.6744	43.6744	42.6886	43.3874	1.004552e+07	-1.184086
1998-01-08	0	43.0505	43.8823	42.8176	43.4165	9.394955e+06	0.067070

Let's look at the distribution of the daily returns.

In [3]:

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.distplot(ibm['daily_return'], ax=ax[0]);
sns.boxplot(ibm['daily_return'], ax=ax[1]);

Just eyeballing these graphs, the distribution looks rather symmetric, with several extreme positive and negative outliers.

We can also look at monthly returns using the resample function.

In [4]:

ibm_months = ibm.resample('M').first()
ibm_months['monthly_return'] = 100 * (ibm_months.close / ibm_months.close.shift(1) - 1)
ibm_months.fillna(0, inplace=True)
ibm_months.head()

Out[4]:

	adjustment	open	high	low	close	volume	daily_return	monthly_return
date
1998-01-31	0	43.4664	43.9614	43.2834	43.9365	5980095.513	0.000000	0.000000
1998-02-28	0	41.5947	42.1146	41.5947	41.7777	8060649.985	1.577727	-4.913455
1998-03-31	0	43.4247	43.5789	41.6783	42.4619	8224167.494	-2.329855	1.637716
1998-04-30	0	43.2955	43.7123	43.0580	43.5039	6083487.445	0.423356	2.453965
1998-05-31	0	48.3219	48.9220	48.1386	48.7887	5799886.054	1.184003	12.147876

In [5]:

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.distplot(ibm_months['monthly_return'], bins=100, ax=ax[0]);
sns.boxplot(ibm_months['monthly_return'], ax=ax[1]);

Interestingly, on a monthly timescale, the returns appear to be somewhat positively skewed.

Stock Market Skew¶

The main point of this post is to show that stock market returns are skewed, and to see how that will inform our investing strategy. First, we will see that there is a definite positive skew, which becomes more apparent in longer timeframes. Let's pull out some statistics over all 500 stocks for daily, weekly, monthly, annual, and two-year returns. In addition, we'll look at all-time average returns (I'm using average annual returns because not all stocks were in the index for the full time period).

In [6]:

import glob

daily_returns = []
weekly_returns = []
monthly_returns = []
annual_returns = []
twoyear_returns = []
alltime_avg_returns = []

files = glob.glob('./quantquote_daily_sp500_83986/daily/*.csv')
for file in files:
    df = pd.read_csv(file, names=['date', 'adjustment', 'open', 'high', 'low', 'close', 'volume'],
                  parse_dates=['date'], index_col='date')
    
    #just to see if there is anything interesting in the adjustment column
    if df.adjustment.sum() != 0:
        print(file)
    
    df['return'] = 100 * (df.close / df.close.shift(1) - 1)
    df.fillna(0, inplace=True)
    daily_returns.append(df['return'])
    
    weekly = df.resample('W').first()
    weekly['return'] = 100 * (weekly.close / weekly.close.shift(1) - 1)
    weekly.fillna(0, inplace=True)
    weekly_returns.append(weekly['return'])
    
    monthly = df.resample('M').first()
    monthly['return'] = 100 * (monthly.close / monthly.close.shift(1) - 1)
    monthly.fillna(0, inplace=True)
    monthly_returns.append(monthly['return'])
    
    annual = df.resample('A').first()
    annual['return'] = 100 * (annual.close / annual.close.shift(1) - 1)
    annual.fillna(0, inplace=True)
    annual_returns.append(annual['return'])
    
    twoyear = df.resample('2A').first()
    twoyear['return'] = 100 * (twoyear.close / twoyear.close.shift(1) - 1)
    twoyear.fillna(0, inplace=True)
    twoyear_returns.append(twoyear['return'])
    
    alltime_return = df.close.iloc[-1] / df.close.iloc[0]
    time = (df.index.max() - df.index.min()) / timedelta(days=365)
    avg_return = 100 * (alltime_return ** (1 / time) - 1)
    alltime_avg_returns.append(avg_return)

all_dailies = pd.concat(daily_returns, axis=0)
all_weeklies = pd.concat(weekly_returns, axis=0)
all_monthlies = pd.concat(monthly_returns, axis=0)
all_annuals = pd.concat(annual_returns, axis=0)
all_twoyears = pd.concat(twoyear_returns, axis=0)

fig, ax = plt.subplots(2, 3, figsize=(14, 10))
#sns.distplot(all_dailies, bins=200, ax=ax[0, 0]).set_title('Daily');
#sns.distplot(all_weeklies, bins=200, ax=ax[0, 1]).set_title('Weekly');
#sns.distplot(all_monthlies, bins=200, ax=ax[0, 2]).set_title('Monthly');
#sns.distplot(all_annuals, bins=200, ax=ax[1, 0]).set_title('Annual');
#sns.distplot(all_twoyears, bins=200, ax=ax[1, 1]).set_title('Two Year');
fig.delaxes(ax[1, 2])
sns.boxplot(all_dailies, ax=ax[0, 0]).set_title('Daily');
sns.boxplot(all_weeklies, ax=ax[0, 1]).set_title('Weekly');
sns.boxplot(all_monthlies, ax=ax[0, 2]).set_title('Monthly');
sns.boxplot(all_annuals, ax=ax[1, 0]).set_title('Annual');
sns.boxplot(all_twoyears, ax=ax[1, 1]).set_title('Two Year');

In this case, I think it is a bit easier to see the skew using the boxplot, rather than the KDE. Although there are a few really high outliers in the daily timeframe, the skew definitely becomes more apparent in longer timeframes. There are clearly a few super high returns, whereas the lowest possible value is -100%. One effect of the really high values is to raise the average so that it is quite a bit higher than the median. This is an important point because, ironically, it means that the average stock (as in typical) underperforms the average of all stocks!

In [7]:

print('daily: mean =', round(np.mean(all_dailies), 2), ', median =', round(np.median(all_dailies), 2))
print('weekly: mean =', round(np.mean(all_weeklies), 2), ', median =', round(np.median(all_weeklies), 2))
print('monthly: mean =', round(np.mean(all_monthlies), 2), ', median =', round(np.median(all_monthlies), 2))
print('annual: mean =', round(np.mean(all_annuals), 2), ', median =', round(np.median(all_annuals), 2))
print('two year: mean =', round(np.mean(all_twoyears), 2), ', median =', round(np.median(all_twoyears), 2))

daily: mean = 0.07 , median = 0.02
weekly: mean = 0.34 , median = 0.25
monthly: mean = 1.41 , median = 1.25
annual: mean = 16.38 , median = 8.8
two year: mean = 29.55 , median = 12.5

Let's see what percentage of returns underperform the mean:

In [8]:

print('daily:', round(100*np.mean(all_dailies < np.mean(all_dailies)), 2))
print('weekly:', round(100*np.mean(all_weeklies < np.mean(all_weeklies)), 2))
print('monthly:', round(100*np.mean(all_monthlies < np.mean(all_monthlies)), 2))
print('annual:', round(100*np.mean(all_annuals < np.mean(all_annuals)), 2))
print('two year:', round(100*np.mean(all_twoyears < np.mean(all_twoyears)), 2))

daily: 51.31
weekly: 50.94
monthly: 50.82
annual: 59.93
two year: 63.61

As we go out in time, more and more stocks underperform the average due to the fact that a few stocks continue to outperform. This does not bode well for a stock investing strategy, especially considering that we cannot predict which stocks will be the outperformers.

Portfolio Size and the Central Limit Theorem¶

What can we do about this? Well, here's where the central limit theorem comes in. As you know, the central limit theorem states that the distribution of the sample mean approaches a normal distribution given a sufficient sample size. This is important because these samples are our potential stock portfolios, and we want them to come as close as possible to the overall average of the stock market, which historically has a relatively good performance compared to other investments, such as bonds and saving accounts.

It is interesting to see how the distribution of our stock portfolio performance changes depending on how many stocks we choose. First, let's see how the sample size affects the sample mean distribution for this simple bimodal distribution:

In [9]:

bimodal = np.random.normal(loc=0, scale=0.2, size=10000).tolist() + np.random.normal(loc=10, scale=0.2, size=10000).tolist()
sns.distplot(bimodal, hist=False);

Let's try some really small sample sizes: N = 1, 2, 5.

In [10]:

fig, ax = plt.subplots(1, 3, figsize=(12, 4))

bimodal_samples_1 = np.random.choice(bimodal, size=100000)
sns.distplot(bimodal, ax=ax[0], hist=False);
sns.distplot(bimodal_samples_1, ax=ax[0], hist=False).set_title('N = 1');

bimodal_samples_2 = []
for i in range(10000):
    bimodal_samples_2.append(np.mean(np.random.choice(bimodal, size=2)))
sns.distplot(bimodal, ax=ax[1], hist=False);
sns.distplot(bimodal_samples_2, ax=ax[1], hist=False).set_title('N = 2');

bimodal_samples_5 = []
for i in range(10000):
    bimodal_samples_5.append(np.mean(np.random.choice(bimodal, size=5)))
sns.distplot(bimodal, ax=ax[2], hist=False);
sns.distplot(bimodal_samples_5, ax=ax[2], hist=False).set_title('N = 5');

The blue line shows our original distribution, while the green line shows the distribution of 10,000 samples. With these small sample sizes, the sample mean distributions look nothing like a normal distribution, but that changes with larger samples.

In [11]:

fig, ax = plt.subplots(1, 2, figsize=(8, 4))

bimodal_samples_100 = []
for i in range(10000):
    bimodal_samples_100.append(np.mean(np.random.choice(bimodal, size=100)))
sns.distplot(bimodal, ax=ax[0], hist=False);
sns.distplot(bimodal_samples_100, ax=ax[0], hist=False).set_title('N = 100');

bimodal_samples_500 = []
for i in range(10000):
    bimodal_samples_500.append(np.mean(np.random.choice(bimodal, size=500)))
sns.distplot(bimodal, ax=ax[1], hist=False);
sns.distplot(bimodal_samples_500, ax=ax[1], hist=False).set_title('N = 500');

Now, we see sample mean distributions that look normal and get narrower with increasing sample size.

Let's see what happens when we sample our annual return data with various sample sizes.

In [12]:

samples_annual_1 = []
samples_annual_2 = []
samples_annual_5 = []
samples_annual_10 = []
samples_annual_50 = []
samples_annual_100 = []
samples_annual_500 = []
for i in range(10000):
    samples_annual_1.append(np.mean(np.random.choice(all_annuals, size=1)))
    samples_annual_2.append(np.mean(np.random.choice(all_annuals, size=2)))
    samples_annual_5.append(np.mean(np.random.choice(all_annuals, size=5)))
    samples_annual_10.append(np.mean(np.random.choice(all_annuals, size=10)))
    samples_annual_50.append(np.mean(np.random.choice(all_annuals, size=50)))
    samples_annual_100.append(np.mean(np.random.choice(all_annuals, size=100)))
    samples_annual_500.append(np.mean(np.random.choice(all_annuals, size=500)))

print('sample size = 1, mean = ', np.mean(samples_annual_1))
print('sample size = 2, mean = ', np.mean(samples_annual_2))
print('sample size = 5, mean = ', np.mean(samples_annual_5))
print('sample size = 10, mean = ', np.mean(samples_annual_10))
print('sample size = 50, mean = ', np.mean(samples_annual_50))
print('sample size = 100, mean = ', np.mean(samples_annual_100))
print('sample size = 500, mean = ', np.mean(samples_annual_500))
fig, ax = plt.subplots(3, 3, figsize=(12, 12))
sns.distplot(samples_annual_1, bins=200, ax=ax[0, 0]).set_title('N = 1');
sns.distplot(samples_annual_2, bins=200, ax=ax[0, 1]).set_title('N = 2');
sns.distplot(samples_annual_5, bins=200, ax=ax[0, 2]).set_title('N = 5');
sns.distplot(samples_annual_10, bins=200, ax=ax[1, 0]).set_title('N = 10');
sns.distplot(samples_annual_50, bins=200, ax=ax[1, 1]).set_title('N = 50');
sns.distplot(samples_annual_100, bins=200, ax=ax[1, 2]).set_title('N = 100');
sns.distplot(samples_annual_500, bins=200, ax=ax[2, 0]).set_title('N = 500');
fig.delaxes(ax[2, 1])
fig.delaxes(ax[2, 2])

sample size = 1, mean =  15.6493823273
sample size = 2, mean =  16.5087698508
sample size = 5, mean =  16.7731603796
sample size = 10, mean =  16.4020029299
sample size = 50, mean =  16.4201639465
sample size = 100, mean =  16.2925873127
sample size = 500, mean =  16.3696187295

In this case, I think the skew is easier to spot in the KDE plot, but maybe it is easier to see the differences when plotted on the same axes.

In [13]:

fig, ax = plt.subplots(figsize=(12, 12))
ax.set_xlim(-100, 100)
sns.distplot(samples_annual_1, bins=200, ax=ax, color='red', label='N = 1');
sns.distplot(samples_annual_2, bins=200, ax=ax, color='orange', label='N = 2');
sns.distplot(samples_annual_5, bins=200, ax=ax, color='yellow', label='N = 5');
sns.distplot(samples_annual_10, bins=200, ax=ax, color='green', label='N = 10');
sns.distplot(samples_annual_50, bins=200, ax=ax, color='blue', label='N = 50');
sns.distplot(samples_annual_100, bins=200, ax=ax, color='purple', label='N = 100');
sns.distplot(samples_annual_500, bins=200, ax=ax, color='black', label='N = 500');
legend = ax.legend(fontsize='x-large', frameon=True);
legend.get_frame().set_facecolor('lightgrey')

The overlay plot makes it easy to see that the more stocks sampled, the narrower the range of likely performances, and also that the distribution is less skewed. Remember that the averages of all these distributions are very similar, but you can see that the tops of the peaks shift right towards the average with increasing sample size. Both of these points are very important. We want our portfolio's performance to be high, but without great risk. Risk is the variance of returns. Although the smaller sample sizes come with the possibility of outperforming the average by a greater amount, they have a higher likelihood of underperforming.

Here are the percentages of samples that underperform the average, by sample size.

In [14]:

print('size 1:', round(100*np.mean(np.array(samples_annual_1) < np.mean(all_annuals)), 2))
print('size 2:', round(100*np.mean(np.array(samples_annual_2) < np.mean(all_annuals)), 2))
print('size 5:', round(100*np.mean(np.array(samples_annual_5) < np.mean(all_annuals)), 2))
print('size 10:', round(100*np.mean(np.array(samples_annual_10) < np.mean(all_annuals)), 2))
print('size 50:', round(100*np.mean(np.array(samples_annual_50) < np.mean(all_annuals)), 2))
print('size 100:', round(100*np.mean(np.array(samples_annual_100) < np.mean(all_annuals)), 2))
print('size 500:', round(100*np.mean(np.array(samples_annual_500) < np.mean(all_annuals)), 2))

size 1: 60.34
size 2: 59.21
size 5: 57.67
size 10: 57.73
size 50: 56.13
size 100: 56.66
size 500: 53.65

As we increase our sample size, we increase the probability of including some of the high performing outliers, and therefore decrease our probability of underperforming the overall mean.

As a side note, what if there was no skew to the distribution of returns? Would the sample size matter? Perhaps surprisingly, the answer is yes. A greater variance of returns will lead to lower performance over time. Consider the following three scenarios:

1) Year 1: 0%, Year 2: 0%
$100 * 1.00 * 1.00 = $100

2) Year 1: +5%, Year 2: -5%
$100 * 1.05 * 0.95 = $99.75

3) Year 1: +10%, Year 2: -10%
$100 * 1.10 * 0.90 = $99

Although the arithmetic average of the returns in all scenarios is 0, the geometric averages decrease with increasing variance.

The samples above were taken from the total list of all annual returns for all stocks. But let's say we just want to pick a few stocks and hold them for a long period of time.

Earlier, we calculated the average annual return for each of the S&P 500 stocks. Here's a look at that distribution.

In [15]:

sns.distplot(alltime_avg_returns);

This distribution is also skewed. What percentage of the stocks underperform the mean?

In [16]:

round(100 * np.mean(alltime_avg_returns < np.mean(alltime_avg_returns)), 2)

Out[16]:

58.200000000000003

Let's sample different numbers of stocks

In [18]:

stock_samples_1 = []
stock_samples_2 = []
stock_samples_5 = []
stock_samples_10 = []
stock_samples_50 = []
stock_samples_100 = []
stock_samples_500 = []
for i in range(10000):
    stock_samples_1.append(np.mean(np.random.choice(alltime_avg_returns, size=1)))
    stock_samples_2.append(np.mean(np.random.choice(alltime_avg_returns, size=2)))
    stock_samples_5.append(np.mean(np.random.choice(alltime_avg_returns, size=5)))
    stock_samples_10.append(np.mean(np.random.choice(alltime_avg_returns, size=10)))
    stock_samples_50.append(np.mean(np.random.choice(alltime_avg_returns, size=50)))
    stock_samples_100.append(np.mean(np.random.choice(alltime_avg_returns, size=100)))
    stock_samples_500.append(np.mean(np.random.choice(alltime_avg_returns, size=500)))

fig, ax = plt.subplots(figsize=(12, 12))
ax.set_xlim(-25, 25)
sns.distplot(stock_samples_1, bins=200, ax=ax, color='red', label='N = 1');
sns.distplot(stock_samples_2, bins=200, ax=ax, color='orange', label='N = 2');
sns.distplot(stock_samples_5, bins=200, ax=ax, color='yellow', label='N = 5');
sns.distplot(stock_samples_10, bins=200, ax=ax, color='green', label='N = 10');
sns.distplot(stock_samples_50, bins=200, ax=ax, color='blue', label='N = 50');
sns.distplot(stock_samples_100, bins=200, ax=ax, color='purple', label='N = 100');
sns.distplot(stock_samples_500, bins=200, ax=ax, color='black', label='N = 500');
legend = ax.legend(fontsize='x-large', frameon=True);
legend.get_frame().set_facecolor('lightgrey')

Again, we can easily see that picking fewer stocks leads to a greater probability of underperforming the mean.

I think this result is really interesting. It shows that even though we can't predict the returns of individual stocks or the market as a whole, there are still decisions to make that can drastically affect our returns and our risk. Clearly, picking a large number of stocks increases our chances of including a few of those really high performers that will raise our average. Fortunately, it is actually easier to invest in a large number of stocks vs. a small number using ETFs.

The Effect of Portfolio Size on Investing Performance

Data¶

Stock Market Skew¶

Portfolio Size and the Central Limit Theorem¶

Comments