Data
Data is the most important part of a backtest. A good strategy can look bad on broken data, and a bad strategy can look good if future information leaks into the past.
Mental Model
FINSABER expects data to be organized by date:
date
price
ticker -> OHLCV bar
news
ticker -> list of articles
filing_k
ticker -> annual filing text
filing_q
ticker -> quarterly filing text
optional extra modality
ticker -> payload
The engine asks the loader: "What could the strategy know on this date?" The strategy should not query future dates directly.
Data Interface
All loaders should implement TradingData. Engines depend on this interface instead of a specific storage format, so users can plug in local parquet, database-backed loaders, or enriched datasets with extra modalities.
Required methods:
get_data_by_date(date)get_ticker_price_by_date(ticker, date, price_field=None)get_ticker_data_by_date(ticker, date)get_tickers_list()get_subset_by_time_range(start_date, end_date)get_ticker_subset_by_time_range(ticker, start_date, end_date)get_date_range()
The contract is date-first. A loader returns all available modalities for one date, and each modality is keyed by ticker where applicable. This keeps strategies from accidentally scanning future rows.
Price Field Requirements
The minimum useful daily bar is:
| Field | Required | Use |
|---|---|---|
open |
Yes | Raw open. Used to derive adjusted open. |
high |
Yes | Raw high. Used to derive adjusted high. |
low |
Yes | Raw low. Used to derive adjusted low. |
close |
Yes | Raw close. Used with adjusted close to compute adjustment factor. |
adjusted_close |
Strongly recommended | Split/dividend-adjusted close. |
volume |
Strongly recommended | Raw share volume for liquidity caps. |
cik |
Optional | Useful for SEC filing alignment. |
If adjusted_close is missing, raw prices may create false jumps around stock splits. For serious historical equity tests, prefer adjusted OHLC for price simulation and raw volume for liquidity.
Parquet Layout
FinsaberParquetDataset expects:
sp500_2000_2025_parquet/
price_daily/year=YYYY/part-000.parquet
news_items/year=YYYY/part-000.parquet
filingk/year=YYYY/part-000.parquet
filingq/year=YYYY/part-000.parquet
Price rows should include:
The loader computes:
adjusted_open = open * adjusted_close / close
adjusted_high = high * adjusted_close / close
adjusted_low = low * adjusted_close / close
Raw volume is retained for liquidity caps.
The loader lazily reads the selected date range and ticker universe. When tickers="all", the ticker list is inferred from price_daily. The default price field is adjusted_close, but execution can use adjusted_open, adjusted_high, and adjusted_low when timing requires them.
In-Memory Format
FinsaberDataset accepts:
{
date: {
"price": {
"AAPL": {
"open": 187.15,
"high": 188.44,
"low": 183.89,
"close": 185.64,
"adjusted_close": 185.30,
"volume": 82488700,
}
},
"news": {"AAPL": ["..."]},
"filing_k": {"AAPL": "..."},
"filing_q": {"AAPL": "..."},
}
}
You may add extra modalities such as:
earnings_callanalyst_reportmacroalternative_data
Keep the daily shape consistent: modality name, ticker key, payload.
Alignment Policy
Price, news, and filings do not always align perfectly. The loader should preserve what is known and avoid guessing aggressively.
| Situation | Recommended behavior |
|---|---|
| Price exists, no news | Return an empty or missing news entry for that ticker/date. |
| News exists, no price | Keep it for feature construction only if the strategy can handle missing tradable price. |
| Filing appears under ticker alias | Deduplicate by accession or CIK before feature construction. |
| Date-only text timestamp | Treat as available from the next decision point unless timestamps prove otherwise. |
| Delisted ticker has no future price | Let the engine skip or reject fills with missing future bars. |
Implementing A Custom Loader
from finsaber import TradingData
class MyDataset(TradingData):
def __init__(self, connection, tickers):
self.connection = connection
self.tickers = tickers
def get_data_by_date(self, date):
return {
"price": self._load_prices(date),
"earnings_call": self._load_calls(date),
}
def get_tickers_list(self):
return list(self.tickers)
Implement the remaining abstract methods by filtering the same underlying source. If a modality is unavailable for a date, return an empty dictionary rather than leaking data from another date.
Custom Dataset Checklist
- Implement all
TradingDatamethods. - Return Python
datekeys or normalize date strings consistently. - Keep ticker symbols consistent across price, news, and filings.
- Include
adjusted_closewhen possible. - Keep raw
volumeunadjusted for liquidity calculations. - Make
get_subset_by_time_rangeandget_ticker_subset_by_time_rangereturn a smaller loader, not a future-aware object. - Add tests with a tiny in-memory dataset before using a full private dataset.
Avoiding Look-Ahead Bias
Date-only news and filings should be considered unavailable until the next decision point unless timestamps prove same-day availability. For strict backtests, prefer execution_timing="next_open".