Introduction
In the last chapter we had a glimpse of Pandas. In this chapter we will learn about resampling methods and the DataFrame object, which is a powerful tool for financial data analysis.
Fetching Data
Here we use the QuantBook to retrieve data...
from datetime import datetime
qb = QuantBook()
We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:
symbol = qb.AddEquity("AAPL").Symbol
aapl_table = qb.History(symbol, datetime(1998, 1, 1), qb.Time, Resolution.Daily).loc[symbol]
aapl = aapl_table['close']['2017']
print(aapl)
Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].
print(aapl['2017-3'])
time
2017-03-01 32.189492
2017-03-02 32.847428
2017-03-03 32.652397
2017-03-04 32.845078
2017-03-07 32.741688
2017-03-08 32.783984
2017-03-09 32.661796
2017-03-10 32.586603
2017-03-11 32.694693
2017-03-14 32.708791
2017-03-15 32.659446
2017-03-16 33.004862
2017-03-17 33.058907
2017-03-18 32.894423
2017-03-21 33.239839
2017-03-22 32.859177
2017-03-23 33.230440
2017-03-24 33.112952
2017-03-25 33.047158
2017-03-28 33.103553
2017-03-29 33.789685
2017-03-30 33.864878
2017-03-31 33.820232
Name: close, dtype: float64
Or in several consecutive months:
aapl['2017-2':'2017-4']
.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.
print(aapl.head())
print(aapl.tail(10))
The output:
time
2017-01-04 27.174753
2017-01-05 27.144338
2017-01-06 27.282376
2017-01-07 27.586527
2017-01-10 27.839207
Name: close, dtype: float64
time
2017-12-16 41.357005
2017-12-19 41.939431
2017-12-20 41.492508
2017-12-21 41.447341
2017-12-22 41.604239
2017-12-23 41.604239
2017-12-27 40.548740
2017-12-28 40.555872
2017-12-29 40.669980
2017-12-30 40.230189
Name: close, dtype: float64
Resampling
series.resample(freq)
is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.
series.resample.mean()
is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:
by_month = aapl.resample('M').mean()
print(by_month)
time
2017-01-31 27.952986
2017-02-28 31.178201
2017-03-31 32.973805
2017-04-30 33.584198
2017-05-31 35.856393
2017-06-30 34.974080
2017-07-31 34.935234
2017-08-31 37.460385
2017-09-30 37.405355
2017-10-31 37.256125
2017-11-30 40.897574
2017-12-31 40.862424
Freq: M, Name: close, dtype: float64
We can also aggregate the data by week:
by_week = aapl.resample('W').mean()
print(by_week.head())
time
2017-01-08 27.296999
2017-01-15 27.894890
2017-01-22 28.062056
2017-01-29 28.347841
2017-02-05 29.448401
Freq: W-SUN, Name: close, dtype: float64
We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.
three_day = aapl.resample('3D').mean()
two_week = aapl.resample('2W').mean()
two_month = aapl.resample('2M').mean()
Besides the mean() method, other methods can also be used with the resampler:
std = aapl.resample('W').std() # standard deviation
max = aapl.resample('W').max() # maximum value
min = aapl.resample('W').min() # minimum value
Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:
last_day = aapl.resample('M').agg(lambda x: x[-1])
print(last_day)
time
2017-01-31 28.456868
2017-02-28 32.175394
2017-03-31 33.820232
2017-04-30 33.754439
2017-05-31 36.257952
2017-06-30 33.900843
2017-07-31 35.274054
2017-08-31 38.693270
2017-09-30 36.506928
2017-10-31 39.491533
2017-11-30 40.289620
2017-12-31 40.230189
Freq: M, Name: close, dtype: float64
Or directly calculate the monthly rates of return using the data for the first day and the last day:
monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[1] - 1)
print(monthly_return)
time
2017-01-31 0.048354
2017-02-28 0.068145
2017-03-31 0.029616
2017-04-30 -0.000348
2017-05-31 0.046060
2017-06-30 -0.062019
2017-07-31 0.041812
2017-08-31 0.092912
2017-09-30 -0.060530
2017-10-31 0.079234
2017-11-30 0.019170
2017-12-31 -0.010640
Freq: M, Name: close, dtype: float64
Series object also provides us some convenient methods to do some quick calculation.
print(monthly_return.mean())
print(monthly_return.std())
print(monthly_return.max())
[out]: 0.024313746219922504
0.050033113016599684
0.09291213766121786
Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.
print(last_day.diff())
print(last_day.pct_change())
time
2017-01-31 NaN
2017-02-28 3.718526
2017-03-31 1.644839
2017-04-30 -0.065794
2017-05-31 2.503514
2017-06-30 -2.357109
2017-07-31 1.373211
2017-08-31 3.419216
2017-09-30 -2.186341
2017-10-31 2.984605
2017-11-30 0.798087
2017-12-31 -0.059431
Freq: M, Name: close, dtype: float64
time
2017-01-31 NaN
2017-02-28 0.130672
2017-03-31 0.051121
2017-04-30 -0.001945
2017-05-31 0.074168
2017-06-30 -0.065009
2017-07-31 0.040507
2017-08-31 0.096933
2017-09-30 -0.056504
2017-10-31 0.081754
2017-11-30 0.020209
2017-12-31 -0.001475
Freq: M, Name: close, dtype: float64
Notice that we induced a NaN value while calculating percentage changes i.e. returns.
When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:
daily_return = last_day.pct_change()
print(daily_return.fillna(0))
time
2017-01-31 0.000000
2017-02-28 0.130672
2017-03-31 0.051121
2017-04-30 -0.001945
2017-05-31 0.074168
2017-06-30 -0.065009
2017-07-31 0.040507
2017-08-31 0.096933
2017-09-30 -0.056504
2017-10-31 0.081754
2017-11-30 0.020209
2017-12-31 -0.001475
Freq: M, Name: close, dtype: float64
Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:
daily_return = last_day.pct_change()
print(daily_return.fillna(method = 'bfill'))
time
2017-01-31 0.130672
2017-02-28 0.130672
2017-03-31 0.051121
2017-04-30 -0.001945
2017-05-31 0.074168
2017-06-30 -0.065009
2017-07-31 0.040507
2017-08-31 0.096933
2017-09-30 -0.056504
2017-10-31 0.081754
2017-11-30 0.020209
2017-12-31 -0.001475
Freq: M, Name: close, dtype: float64
As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.
We can also simply remove NaN values by .dropna()
daily_return = last_day.pct_change().dropna()
print(daily_return)
time
2017-02-28 0.130672
2017-03-31 0.051121
2017-04-30 -0.001945
2017-05-31 0.074168
2017-06-30 -0.065009
2017-07-31 0.040507
2017-08-31 0.096933
2017-09-30 -0.056504
2017-10-31 0.081754
2017-11-30 0.020209
2017-12-31 -0.001475
Freq: M, Name: close, dtype: float64
DataFrame
The DataFrame is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.
More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.
Create DataFrames
The most common method of creating a DataFrame is passing a dictionary:
dict = {'AAPL': [143.5, 144.09, 142.73, 144.18, 143.77],
'GOOG': [898.7, 911.71, 906.69, 918.59, 926.99],
'IBM': [155.58, 153.67, 152.36, 152.94, 153.49]}
dates = pd.date_range('2017-07-03', periods = 5, freq = 'D')
df = pd.DataFrame(dict, index = dates)
print(df)
AAPL GOOG IBM
2017-07-03 143.50 898.70 155.58
2017-07-04 144.09 911.71 153.67
2017-07-05 142.73 906.69 152.36
2017-07-06 144.18 918.59 152.94
2017-07-07 143.77 926.99 153.49
Manipulating DataFrames
We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: df['column_name']
If a column name contains no spaces, then we can also use df.column_name to fetch a column:
df = aapl_table
print(df.close.tail(5))
print(df['volume'].tail(5))
time
2022-11-23 150.18
2022-11-24 151.07
2022-11-26 148.11
2022-11-29 144.22
2022-11-30 141.17
Name: close, dtype: float64
time
2022-11-23 49725746.0
2022-11-24 54910091.0
2022-11-26 32690261.0
2022-11-29 65935833.0
2022-11-30 81196875.0
Name: volume, dtype: float64
All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:
aapl_2016 = df['2016']
aapl_month = aapl_2016.resample('M').agg(lambda x: x[-1])
print(aapl_month)
close high low open volume
time
2016-01-31 22.278245 22.278245 21.593922 21.632830 260957577.0
2016-02-29 22.300185 22.555610 22.224248 22.366918 118863851.0
2016-03-31 25.211106 25.409003 24.990198 24.999403 189291395.0
2016-04-30 21.570729 24.605911 21.287691 21.630559 276220537.0
2016-05-31 23.232365 23.260146 22.977700 23.021688 152055352.0
2016-06-30 21.854860 21.889587 21.676595 21.755309 150690926.0
2016-07-31 24.126006 24.204721 24.003304 24.121376 114985879.0
2016-08-31 24.673355 24.789739 24.556971 24.631457 100754470.0
2016-09-30 26.111858 26.488941 26.023407 26.349281 150588858.0
2016-10-31 26.470320 26.817144 26.407473 26.500580 149627964.0
2016-11-30 26.077469 26.210827 25.752261 25.897317 118270356.0
2016-12-31 27.097545 27.420414 27.006300 27.291734 121865341.0
We may select certain columns of a DataFrame using their names:
aapl_bar = aapl_month[['open', 'high', 'low', 'close']]
print(aapl_bar)
open high low close
time
2016-01-31 21.632830 22.278245 21.593922 22.278245
2016-02-29 22.366918 22.555610 22.224248 22.300185
2016-03-31 24.999403 25.409003 24.990198 25.211106
2016-04-30 21.630559 24.605911 21.287691 21.570729
2016-05-31 23.021688 23.260146 22.977700 23.232365
2016-06-30 21.755309 21.889587 21.676595 21.854860
2016-07-31 24.121376 24.204721 24.003304 24.126006
2016-08-31 24.631457 24.789739 24.556971 24.673355
2016-09-30 26.349281 26.488941 26.023407 26.111858
2016-10-31 26.500580 26.817144 26.407473 26.470320
2016-11-30 25.897317 26.210827 25.752261 26.077469
2016-12-31 27.291734 27.420414 27.006300 27.097545
We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:
print(aapl_month.loc['2016-03':'2016-06', ['open', 'high', 'low', 'close']])
open high low close
time
2016-03-31 24.999403 25.409003 24.990198 25.211106
2016-04-30 21.630559 24.605911 21.287691 21.570729
2016-05-31 23.021688 23.260146 22.977700 23.232365
2016-06-30 21.755309 21.889587 21.676595 21.854860
The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:
above = aapl_bar[aapl_bar.close > np.mean(aapl_bar.close)]
print(above)
open high low close
time
2016-03-31 24.999403 25.409003 24.990198 25.211106
2016-08-31 24.631457 24.789739 24.556971 24.673355
2016-09-30 26.349281 26.488941 26.023407 26.111858
2016-10-31 26.500580 26.817144 26.407473 26.470320
2016-11-30 25.897317 26.210827 25.752261 26.077469
2016-12-31 27.291734 27.420414 27.006300 27.097545
Data Validation
As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:
aapl_bar['rate_return'] = aapl_bar.close.pct_change()
print(aapl_bar)
open high low close rate_return
time
2016-01-31 21.632830 22.278245 21.593922 22.278245 NaN
2016-02-29 22.366918 22.555610 22.224248 22.300185 0.000985
2016-03-31 24.999403 25.409003 24.990198 25.211106 0.130533
2016-04-30 21.630559 24.605911 21.287691 21.570729 -0.144396
2016-05-31 23.021688 23.260146 22.977700 23.232365 0.077032
2016-06-30 21.755309 21.889587 21.676595 21.854860 -0.059292
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117
Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. isnull() provides a convenient way to check abnormal values.
missing = aapl_bar.isnull()
print(missing)
print('---------------------------------------------')
print(missing.describe())
open high low close rate_return
time
2016-01-31 False False False False True
2016-02-29 False False False False False
2016-03-31 False False False False False
2016-04-30 False False False False False
2016-05-31 False False False False False
2016-06-30 False False False False False
2016-07-31 False False False False False
2016-08-31 False False False False False
2016-09-30 False False False False False
2016-10-31 False False False False False
2016-11-30 False False False False False
2016-12-31 False False False False False
---------------------------------------------
open high low close rate_return
count 12 12 12 12 12
unique 1 1 1 1 2
top False False False False False
freq 12 12 12 12 11
The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.
We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.
We can also find the rows with missing values easily:
print(missing[missing.rate_return == True])
open high low close rate_return
time
2016-01-31 False False False False True
Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method dropna() and fillna() can be applied to a DataFrame.
drop = aapl_bar.dropna()
print(drop)
print('\n--------------------------------------------------\n')
fill = aapl_bar.fillna(0)
print(fill)
open high low close rate_return
time
2016-02-29 22.366918 22.555610 22.224248 22.300185 0.000985
2016-03-31 24.999403 25.409003 24.990198 25.211106 0.130533
2016-04-30 21.630559 24.605911 21.287691 21.570729 -0.144396
2016-05-31 23.021688 23.260146 22.977700 23.232365 0.077032
2016-06-30 21.755309 21.889587 21.676595 21.854860 -0.059292
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117
--------------------------------------------------
open high low close rate_return
time
2016-01-31 21.632830 22.278245 21.593922 22.278245 0.000000
2016-02-29 22.366918 22.555610 22.224248 22.300185 0.000985
2016-03-31 24.999403 25.409003 24.990198 25.211106 0.130533
2016-04-30 21.630559 24.605911 21.287691 21.570729 -0.144396
2016-05-31 23.021688 23.260146 22.977700 23.232365 0.077032
2016-06-30 21.755309 21.889587 21.676595 21.854860 -0.059292
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117
DataFrame Concat
We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.
In Pandas, the function concat() allows us to merge multiple Series into a DataFrame:
s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL')
s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG')
data_frame = pd.concat([s1, s2], axis = 1)
print(data_frame)
AAPL GOOG
0 143.50 898.70
1 144.09 911.71
2 142.73 906.69
3 144.18 918.59
4 143.77 926.99
The "axis = 1" parameter will join two DataFrames by columns:
log_price = np.log(aapl_bar.close)
log_price.name = 'log_price'
print(log_price)
print('\n--------------------------------------------\n')
concat = pd.concat([aapl_bar, log_price], axis = 1)
print(concat)
time
2016-01-31 3.103611
2016-02-29 3.104595
2016-03-31 3.227285
2016-04-30 3.071337
2016-05-31 3.145546
2016-06-30 3.084423
2016-07-31 3.183290
2016-08-31 3.205724
2016-09-30 3.262390
2016-10-31 3.276024
2016-11-30 3.261072
2016-12-31 3.299443
Freq: M, Name: log_price, dtype: float64
--------------------------------------------
open high low close rate_return log_price
time
2016-01-31 21.632830 22.278245 21.593922 22.278245 NaN 3.103611
2016-02-29 22.366918 22.555610 22.224248 22.300185 0.000985 3.104595
2016-03-31 24.999403 25.409003 24.990198 25.211106 0.130533 3.227285
2016-04-30 21.630559 24.605911 21.287691 21.570729 -0.144396 3.071337
2016-05-31 23.021688 23.260146 22.977700 23.232365 0.077032 3.145546
2016-06-30 21.755309 21.889587 21.676595 21.854860 -0.059292 3.084423
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919 3.183290
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687 3.205724
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302 3.262390
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728 3.276024
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841 3.261072
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117 3.299443
We can also join two DataFrames by rows. Consider these two DataFrames:
df_volume = aapl_table.loc['2016-10':'2017-04', ['volume']].resample('M').agg(lambda x: x[-1])
print(df_volume)
print('\n-------------------------------------------\n')
df_2017 = aapl_table.loc['2016-10':'2017-04', ['open', 'high', 'low', 'close']].resample('M').agg(lambda x: x[-1])
print(df_2017)
volume
time
2016-10-31 149627964.0
2016-11-30 118270356.0
2016-12-31 121865341.0
2017-01-31 126366565.0
2017-02-28 80246377.0
2017-03-31 81028846.0
2017-04-30 80784430.0
-------------------------------------------
open high low close
time
2016-10-31 26.500580 26.817144 26.407473 26.470320
2016-11-30 25.897317 26.210827 25.752261 26.077469
2016-12-31 27.291734 27.420414 27.006300 27.097545
2017-01-31 28.293094 28.456868 28.227585 28.456868
2017-02-28 32.191842 32.295232 32.022659 32.175394
2017-03-31 33.869578 33.954169 33.141149 33.820232
2017-04-30 33.853129 33.907174 33.662798 33.754439
Now we merge the DataFrames with our DataFrame 'aapl_bar'
concat = pd.concat([aapl_bar, df_volume], axis = 1)
print(concat)
open high low close rate_return \
time
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117
2017-01-31 NaN NaN NaN NaN NaN
2017-02-28 NaN NaN NaN NaN NaN
2017-03-31 NaN NaN NaN NaN NaN
2017-04-30 NaN NaN NaN NaN NaN
volume
time
2016-07-31 NaN
2016-08-31 NaN
2016-09-30 NaN
2016-10-31 149627964.0
2016-11-30 118270356.0
2016-12-31 121865341.0
2017-01-31 126366565.0
2017-02-28 80246377.0
2017-03-31 81028846.0
2017-04-30 80784430.0
By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join':
concat = pd.concat([aapl_bar, df_volume], axis = 1, join = 'inner')
print(concat)
open high low close rate_return \
time
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117
volume
time
2016-10-31 149627964.0
2016-11-30 118270356.0
2016-12-31 121865341.0
Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:
append = aapl_bar.append(df_2017)
print(append)
open high low close rate_return
time
2016-01-31 21.632830 22.278245 21.593922 22.278245 NaN
2016-02-29 22.366918 22.555610 22.224248 22.300185 0.000985
2016-03-31 24.999403 25.409003 24.990198 25.211106 0.130533
2016-04-30 21.630559 24.605911 21.287691 21.570729 -0.144396
2016-05-31 23.021688 23.260146 22.977700 23.232365 0.077032
2016-06-30 21.755309 21.889587 21.676595 21.854860 -0.059292
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117
2016-10-31 26.500580 26.817144 26.407473 26.470320 NaN
2016-11-30 25.897317 26.210827 25.752261 26.077469 NaN
2016-12-31 27.291734 27.420414 27.006300 27.097545 NaN
2017-01-31 28.293094 28.456868 28.227585 28.456868 NaN
2017-02-28 32.191842 32.295232 32.022659 32.175394 NaN
2017-03-31 33.869578 33.954169 33.141149 33.820232 NaN
2017-04-30 33.853129 33.907174 33.662798 33.754439 NaN
'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:
concat = pd.concat([aapl_bar, df_2017], axis = 0)
print(concat)
open high low close rate_return
time
2016-01-31 21.632830 22.278245 21.593922 22.278245 NaN
2016-02-29 22.366918 22.555610 22.224248 22.300185 0.000985
2016-03-31 24.999403 25.409003 24.990198 25.211106 0.130533
2016-04-30 21.630559 24.605911 21.287691 21.570729 -0.144396
2016-05-31 23.021688 23.260146 22.977700 23.232365 0.077032
2016-06-30 21.755309 21.889587 21.676595 21.854860 -0.059292
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117
2016-10-31 26.500580 26.817144 26.407473 26.470320 NaN
2016-11-30 25.897317 26.210827 25.752261 26.077469 NaN
2016-12-31 27.291734 27.420414 27.006300 27.097545 NaN
2017-01-31 28.293094 28.456868 28.227585 28.456868 NaN
2017-02-28 32.191842 32.295232 32.022659 32.175394 NaN
2017-03-31 33.869578 33.954169 33.141149 33.820232 NaN
2017-04-30 33.853129 33.907174 33.662798 33.754439 NaN
Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:
df_2017.columns = ['change', 'high', 'low', 'close']
concat = pd.concat([aapl_bar, df_2017], axis = 0)
print(concat)
open high low close rate_return change
time
2016-01-31 21.632830 22.278245 21.593922 22.278245 NaN NaN
2016-02-29 22.366918 22.555610 22.224248 22.300185 0.000985 NaN
2016-03-31 24.999403 25.409003 24.990198 25.211106 0.130533 NaN
2016-04-30 21.630559 24.605911 21.287691 21.570729 -0.144396 NaN
2016-05-31 23.021688 23.260146 22.977700 23.232365 0.077032 NaN
2016-06-30 21.755309 21.889587 21.676595 21.854860 -0.059292 NaN
2016-07-31 24.121376 24.204721 24.003304 24.126006 0.103919 NaN
2016-08-31 24.631457 24.789739 24.556971 24.673355 0.022687 NaN
2016-09-30 26.349281 26.488941 26.023407 26.111858 0.058302 NaN
2016-10-31 26.500580 26.817144 26.407473 26.470320 0.013728 NaN
2016-11-30 25.897317 26.210827 25.752261 26.077469 -0.014841 NaN
2016-12-31 27.291734 27.420414 27.006300 27.097545 0.039117 NaN
2016-10-31 NaN 26.817144 26.407473 26.470320 NaN 26.500580
2016-11-30 NaN 26.210827 25.752261 26.077469 NaN 25.897317
2016-12-31 NaN 27.420414 27.006300 27.097545 NaN 27.291734
2017-01-31 NaN 28.456868 28.227585 28.456868 NaN 28.293094
2017-02-28 NaN 32.295232 32.022659 32.175394 NaN 32.191842
2017-03-31 NaN 33.954169 33.141149 33.820232 NaN 33.869578
2017-04-30 NaN 33.907174 33.662798 33.754439 NaN 33.853129
Since the column name of 'open' has been changed, the new DataFrame has an new column named 'change'.
Summary
Hereby we introduced the most import part of python: resampling and DataFrame manipulation. We only introduced the most commonly used method in Financial data analysis. There are also many methods used in data mining, which are also beneficial. You can always check the Pandas official documentations for help.