Introduction

This research diversifies SPY by clustering its top 200 weighted constituents using topological data analysis (TDA). The strategy aims to minimize correlation risk by employing KepplerMapper for projection, Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for clustering. The resulting clusters are then used to construct a portfolio with equal weighting across giant and small clusters and within small clusters. This method is expected to enhance portfolio diversification, reduce its correlation with SPY, and reduce the drawdown.

Background

Topological Data Analysis (TDA) is a field of data analysis that uses topological techniques to understand the shape and structure of data. TDA can reveal hidden patterns and relationships in high-dimensional data, allowing us to cluster securities with less obvious correlations in non-linear hyperspace.

The Mapper Algorithm is a tool used in TDA to project high-dimensional data into a lower-dimensional space while preserving the combinatorial representations of the topological structure. It clusters the data more computationally efficiently in the topological space.

Principal Component Analysis (PCA) is a dimension reduction method that transforms data into a new set of linear combinations of the original variables and drops the new variables with low variance. It can preserve most of the information in the dataset while removing noise and dimension quickly to facilitate the upcoming processes' speed and accuracy.

Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction technique that further projects high-dimensional data in a lower-dimensional space. It was chosen for its ability to preserve local and global data structures, which is essential for accurate clustering while being computationally efficient.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a clustering algorithm that groups closely packed projected points, making it robust to outliers. DBSCAN with correlation distance is applied to identify clusters based on the similarity of stock returns so that we can evenly dissipate the non-systematic risk.

The highly correlated projected points by the Mapper Algorithm would create small clusters, which can be represented as nodes in a graph. Correlated small clusters would connect by edges in the graph, and each connected structure would be considered a large cluster. The overall graph is called a simplicial complex, which represents the underlying topological structure of the data.

AD_4nXcRc6lQVzSODZ5k_1u8KaWN5vPWwzVRwcTgEXj89u-CX91bx5ToC7GnijZtp2K0PhvYuDuWAcMvXpIIK_C3ejnaWuISq7KAT83Qdt6E1RB19kZzPDWUMsqqhQl4tWvu4nVu3Pr0?key=WIXkN0A1qXi7IKPI-1Lq3ART

Since the clusters and subclusters are low correlated, we can split the capital equally between clusters and subclusters and within subclusters to dissipate the capital risk evenly.

Implementation

To implement this strategy, we start by selecting SPY's top 200 weighted constituents in the initialize method with a TopologicalGraphUniverseSelectionModel class. 

  1. universe_model = TopologicalGraphUniverseSelectionModel(
  2. "SPY",
  3. history_lookback,
  4. recalibrate_period,
  5. lambda u: [x.symbol for x in sorted(
  6. [x for x in u if x.weight],
  7. key=lambda x: x.weight,
  8. reverse=True
  9. )[:200]]
  10. )

In the TopologicalGraphUniverseSelectionModel class, we use a KepplerMapper to project the log returns into a lower-dimensional space using PCA and UMAP. We then apply DBSCAN to cluster the projected data using correlation distance.

  1. prices = algorithm.history(self.universe.selected, lookback_window, Resolution.DAILY).unstack(0).close
  2. log_returns = np.log(prices / prices.shift(1)).dropna().T
  3. mapper = km.KeplerMapper()
  4. projected_data = mapper.fit_transform(log_returns, projection=[PCA(n_components=0.8, random_state=1), UMAP(n_components=1, random_state=1, n_jobs=-1)])
  5. graph = mapper.map(projected_data, log_returns, clusterer=DBSCAN(metric='correlation', n_jobs=-1))

The resulting clusters are analyzed to identify giant clusters (connected simplicial complexes) and small clusters (grouped nodes). We then construct the portfolio by assigning equal weights to each giant cluster, each small cluster, and stocks within small clusters in the weight_distribution method of the EqualClustersWeightingPortfolioConstructionModel class.

  1. def weight_distribution(self, clustered_symbols):
  2. weights = {}
  3. def assign_weights(nested_list, level=1):
  4. num_elements = len(nested_list)
  5. weight_per_element = 1 / num_elements
  6. for item in nested_list:
  7. if isinstance(item, list):
  8. assign_weights(item, level + 1)
  9. else:
  10. weights[item] = weights.get(item, 0) + weight_per_element / (2 ** (level - 1))
  11. assign_weights(clustered_symbols)
  12. return pd.Series(weights) / sum(weights.values())

Results

The strategy was backtested from March 2020 to March 2025 using the LEAN engine. The benchmark was buy-and-hold SPY and a normalized top 200 weighted SPY constituents strategy. The strategy yielded the following performance metrics over the backtest period.

TDA Portfolio (Proposed)Normalized Top 200 Weighted SPY ConstituentsBuy-and-Hold SPY
Beta0.8030.9261 (Reference)
Annualize Variance0.0160.0180.021
Maximum Drawdown17.4%24.5%26.3%

We ran a parameter optimization job to test the sensitivity of the chosen parameters. We tested a historical data lookback window of 50 weeks to 500 weeks in steps of 50 weeks and a simplicial complex reconstruction period of daily(0), weekly(1), monthly(2), or yearly(3). Of the 40 parameter combinations, 40/40 (100%) produced a lower Beta than the benchmark, and 32/40 (80.0%) made a smaller maximum drawdown than the benchmark.

AD_4nXcfjFJ5CNEPl5uYhbKw7rfrFOTrfZxYEvDV6sgDQEBwRJajmkIwFA1gfHMpyZP-4iOM3adZafY8C2k-k74mb9LrlvBKQsfj7SB_kf0hGIYTMCNLp4nIcB5LIW1_q_RGQ34mu8zjIg?key=WIXkN0A1qXi7IKPI-1Lq3ART

The red circle in the preceding image identifies the parameters we chose as the strategy's default. We chose a historical data lookback window of 200 weeks and a yearly simplicial complex reconstruction because they produced the lowest drawdown and correlation with the benchmark.

We also yielded a 0.012 Alpha, demonstrating the potential of using TDA and clustering techniques to enhance risk-adjusted return while lowering the correlation with the benchmark.

References

  • Carlsson, G. (2009). Topology and data. Bulletin of the American Mathematical Society, 46(2), 255-308.
  • McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
  • Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, 96, 226-231.

Author

Louis Szeto

7 days ago