Datasets
Misconceptions
Introduction
Some data issues are reported that aren't actually data issues. Instead, they are from a misunderstanding of how the data is collected, timestamped, formatted, and normalized. These misunderstandings are caused by assumptions that the data should be the same across different platforms, should have the same timezones, should be timestamped a certain way, and should be normalized the same as other data sources.
Cross-Platform Discrepancies
You may find our data can sometimes be slightly different from the data that's displayed on other platforms. Most of the differences occur because our data is institutional quality while a lot of the other platforms use a cheaper alternative. We use the Consolidated Tape Association (CTA) and Unlisted Trading Privileges (UTP) tick feeds, which cover the entire US tick feed. In contrast, most charting websites use the Better Alternative Trading System (BATS), which has very permissive display policies but only covers about 6-7% of the total market volume. Our tick feed doesn't include over-the-counter (OTC) trades, but the data on other platforms like Yahoo Finance include OTC trades.
Timezone Differences
Datasets all have different timezones. Most price datasets are timestamped in Eastern Time (ET). However, Future markets have more exotic timezones, depending on where the Future contract is trading. QuantConnect allows the raw data to be in different timezones. For US Equities, the timezone is ET. For Forex prices, the timezone is Coordinated Universal Time (UTC). In contrast, other charting platforms may display data with ET timestamps. Forex uses UTC, but CFD uses timezones relative to each of the CFD products that lists. QuantConnect accurately reflects all of these timezones from the relative markets that they're trading.
Misaligned Timestamps
Every piece of data has a period. Some data is near-instantaneous, like tick data. Other data has a longer period, like second, minute, hour, and daily bars. QuantConnect delivers this data to your algorithms at the end of the period to ensure that lookahead bias doesn’t occur. When you look at the Time
time
property of your algorithm, the period has already ended, so it looks as if the data is offset by one period. To compare the timestamps of our data to other data, use the Time
time
property of the current bar. The Time
time
property of the bar is the start of the bar and the EndTime
end_time
property is the end of the bar. If you use Python and request historical data, the time
index in the DataFrame that's returned maps to the EndTime
end_time
of the respective bar. For more information about timestamps, see Time Modeling.
Data Normalization
The data normalization mode defines how historical data is adjusted to accommodate for splits, dividends, and continuous Future contract roll overs. When you compare the data in the Dataset Market to data that's hosted on other platforms, the data may have different values because a different data normalization mode is being used to adjust the data. Ensure datasets are using the same normalization mode before reporting data issues. The most common way to recognize this bug is by comparing the two price series and seeing them significantly deviate in the past. The following data normalization modes are available:
Adjusted Prices
By default, LEAN adjusts US Equity data for splits and dividends to produce a smooth price curve. We use the entire split and dividend history to adjust historical prices. This process ensures you get the same adjusted prices, regardless of the backtest end date.
Backtest differences occur when you run backtests before a split or dividend occurs in live trading and then run the same backtest after it occurs. The second time you run the backtest, the adjusted prices will be different so it can cause different backtest results. The difference can be significant in large universes because of multiple corporate actions and the cummulative effect of orders with a small difference.
Opening and Closing Auctions
The opening and closing price of the day is set by very specific opening and closing auction ticks. When a stock like Apple is listed, it’s listed on Nasdaq. The open auction tick on Nasdaq is the price that’s used as the official open of the day. NYSE, BATS, and other exchanges also have opening auctions, but the only official opening price for Apple is the opening auction on the exchange where it was listed.
We set the opening and closing prices of the first and last bars of the day to the official auction prices. This process is used for second, minute, hour, and daily bars for the 9:30 AM and 4:30 PM Eastern Time (ET) prices. In contrast, other platforms might not be using the correct opening and closing prices.
The official auction prices are usually emitted 2-30 seconds after the market open and close. We do our best to use the official opening and closing prices in the bars we build, but the delay can be so large that there isn't enough time to update the opening and closing price of the bar before it's injected into your algorithms. For example, if you subscribe to second resolution data, we wait until the end of the second for the opening price but most second resolution data won’t get the official opening price. If you subscribe to minute resolution data, we wait until the end of the minute for the opening auction price. Most of the time, you’ll get the actual opening auction price with minute resolution data, but there are always exceptions. Nasdaq and NYSE can have delays in publishing the opening auction price, but we don’t have control over those issues and we have to emit the data on time so that you get the bar you are expecting.
Live and Backtesting Differences
In live trading, bars are built using the exchange timestamps with microsecond accuracy. This microsecond-by-microsecond processing of the ticks can mean that the individual bars between live trading and backtesting can have slightly different ticks. As a result, it's possible for a tick to be counted in different bars between backtesting and live trading, which can lead to bars having slightly different open, high, low, close, and volume values.
There is a delay in when new live data is available for backtesting. It's normally available after 24-48 hours. If you need to closely monitor new data, use live paper trading.