Introduction

This research diversifies SPY by clustering its top 200 weighted constituents using topological data analysis (TDA). The strategy aims to minimize correlation risk by employing KepplerMapper for projection, Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for clustering. The resulting clusters are then used to construct a portfolio with equal weighting by clusters and subclusters. This method is expected to enhance portfolio diversification, reduce its correlation with SPY, and reduce the drawdown.

Background

Topological Data Analysis (TDA) is a field of data analysis that explores and quantifies the shape and structure of data. TDA can reveal hidden patterns and relationships in high-dimensional data, allowing us to cluster securities with less obvious non-linear correlations.

The Mapper Algorithm is a tool used in TDA to project high-dimensional data into a lower-dimensional space while preserving the combinatorial representations of the topological structure. It clusters the data more computationally efficiently in the topological space.

Principal Component Analysis (PCA) is a dimension reduction method that transforms data into a new set of linear combinations of the original variables by dropping the new variables with low variance. It can preserve most of the information in the dataset while removing noise and dimension quickly to facilitate the upcoming processes' speed and accuracy.

Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction technique that further projects high-dimensional data in a lower-dimensional space. It was chosen for its ability to preserve local and global data structures, which is essential for accurate clustering while being computationally efficient.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a clustering algorithm that groups closely packed projected points, making it robust to outliers. DBSCAN with correlation distance is applied to identify clusters based on the similarity of stock returns so that we can evenly dissipate the non-systematic risk.

The highly correlated projected points by the Mapper Algorithm would create small clusters, which can be represented as nodes in a graph. Correlated small clusters would connect by edges in the graph, and each connected structure would be considered a large cluster. The overall graph is called a simplicial complex, which represents the underlying topological structure of the data.

AD_4nXcRc6lQVzSODZ5k_1u8KaWN5vPWwzVRwcTgEXj89u-CX91bx5ToC7GnijZtp2K0PhvYuDuWAcMvXpIIK_C3ejnaWuISq7KAT83Qdt6E1RB19kZzPDWUMsqqhQl4tWvu4nVu3Pr0?key=WIXkN0A1qXi7IKPI-1Lq3ART

Since the clusters have low correlations, we can split the capital equally between clusters and subclusters and within subclusters to dissipate the capital risk evenly.

Implementation

To implement this strategy, we start by selecting SPY's top 200 weighted constituents in the initialize method with a TopologicalGraphUniverseSelectionModel class. 

  1. universe_model = TopologicalGraphUniverseSelectionModel(
  2. "SPY",
  3. history_lookback,
  4. recalibrate_period,
  5. lambda u: [x.symbol for x in sorted(
  6. [x for x in u if x.weight],
  7. key=lambda x: x.weight,
  8. reverse=True
  9. )[:200]]
  10. )

In the TopologicalGraphUniverseSelectionModel class, we use a KepplerMapper to project the log returns into a lower-dimensional space using PCA and UMAP. We then apply DBSCAN to cluster the projected data using correlation distance.

  1. prices = algorithm.history(self.universe.selected, lookback_window, Resolution.DAILY).unstack(0).close
  2. log_returns = np.log(prices / prices.shift(1)).dropna().T
  3. mapper = km.KeplerMapper()
  4. projected_data = mapper.fit_transform(log_returns, projection=[PCA(n_components=0.8, random_state=1), UMAP(n_components=1, random_state=1, n_jobs=-1)])
  5. graph = mapper.map(projected_data, log_returns, clusterer=DBSCAN(metric='correlation', n_jobs=-1))

The resulting clusters are analyzed to identify giant clusters (connected simplicial complexes) and small clusters (grouped nodes). We then construct the portfolio by assigning equal weights to each giant cluster, each small cluster, and stocks within small clusters in the weight_distribution method of the EqualClustersWeightingPortfolioConstructionModel class.

  1. def weight_distribution(self, clustered_symbols):
  2. weights = {}
  3. def assign_weights(nested_list, level=1):
  4. num_elements = len(nested_list)
  5. weight_per_element = 1 / num_elements
  6. for item in nested_list:
  7. if isinstance(item, list):
  8. assign_weights(item, level + 1)
  9. else:
  10. weights[item] = weights.get(item, 0) + weight_per_element / (2 ** (level - 1))
  11. assign_weights(clustered_symbols)
  12. return pd.Series(weights) / sum(weights.values())

Results

The strategy was backtested from March 2020 to March 2025 on QuantConnect. The benchmark was buy-and-hold SPY and a normalized top 200 weighted SPY constituents strategy. The strategy yielded the following performance metrics over the backtest period.

TDA Portfolio (Proposed)Normalized Top 200 Weighted SPY ConstituentsBuy-and-Hold SPY
Sharpe Ratio0.8370.8170.807
Beta0.8950.9261 (Reference)
Annualize Variance0.020.0180.021
Maximum Drawdown17.2%24.5%26.3%

We ran a parameter optimization job to test the sensitivity of the chosen parameters. We tested a historical data lookback window of 50 weeks to 350 weeks in steps of 50 weeks and a simplicial complex reconstruction period of 25 days to 250 days in steps of 25 days. Of the 70 parameter combinations, 8/70 (11.4%) have a higher Sharpe Ratio than the benchmark, 65/70 (92.9%) produced a lower Beta than the benchmark, and 70/40 (100%) made a smaller maximum drawdown than the benchmark.

The red circle in the preceding image identifies the parameters we chose as the strategy's default. We chose a historical data lookback window of 150 weeks and a 125-day simplicial complex reconstruction because they produced the best risk-adjusted return while maintaining a low drawdown level and low correlation with the benchmark. We yielded a 0.018 Alpha, demonstrating the potential of using TDA and clustering techniques to enhance risk-adjusted return while lowering the correlation with the benchmark.

References

  • Carlsson, G. (2009). Topology and data. Bulletin of the American Mathematical Society, 46(2), 255-308.
  • McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
  • Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, 96, 226-231.

Author

Louis Szeto

11 days ago