# S&P 500 Simulator with Cellular Automata

This won't make you the next Jim Simons but it's a fun way to learn about the stock market and cellular automata.

**The goal of this project is not to create a perfect stock market simulator or a complex cellular automata model.**
It is to explore the idea of using cellular automata and neural networks together to learn about the intersection
of these two fields. The predictions made by the neural networks are not accurate and should not be used for
investment decisions.

**Read the Notebook for a detailed explanation of the project: SP500-Cellular-Automata.ipynb**

## What's special?

**Cellular Automata meets Neural Networks**

- Grid placement is calculated analytically by using over 30 years of historical data of the S&P 500
- The grid weights (state transitions) are LSTM models trained on real historical data

### Demo

## Usage

### Requirements

### Installation

### Running the Simulator

You can also simply pass the arguments without `--ticker`

and `--change`

as long as they are passed in order.

To run with default arguments (AAPL and 0.245)

To run with custom arguments (MSFT and -0.147)

This will run the simulator by using the previous 4 days of real data, and then setting the 5th day change for the provided stock ticker before simulating the performance of the other stocks.

NOTE: Running the simulator for the first time can take 10-30s to set up. Each simulation takes 5-15s.

### Downloading Latest Data

Update the `start_date`

and `end_date`

in `data.py`

to the desired date range.
The default is *1st Jan 1990* to *31st Dec 2022.*

**4 Million Data Points** over 33 years for 503 stocks in the S&P 500.

Then run the following command:

This will perform the following steps:

- Get the latest list of stocks in the S&P 500 from Wikipedia
- Downloads the historical data for each stock from Stooq using
`pandas-datareader`

- Uses multiprocessing to speed up the download but can take ~2-5 minutes
- Use any applicable data source for
`pandas-datareader`

by updating`data = web.DataReader(ticker, 'stooq', start_date, end_date)`

in`get_stock_data()`

- Saves the data to
`data/<TICKER>.parquet`

. Uses`parquet`

format for faster read/write times.

- Downloads the latest market capitalization data from Yahoo Finance and saves it
to
`sp500_market_caps.json`

### Generating the grid

To find the optimal grid positions for each stock, run the following command:

This saves the grid positions to `sp500_grid.csv`

.

The script performs the following:

- Load the historical data for all stocks
- Performs hierarchical clustering on the stocks based on their correlation
- Creates a force-directed graph to find the optimal grid positions for each stock
- Performs an evaluation of the grid positions by checking cluster tightness and concentrating larger market cap stocks in the center of the grid
- Repeat steps 3 and 4 for 10000 iterations while evaluating the positions to find the best grid placement

Here is the final grid placement:

### Training the Neural Network

To train the LSTM models for grid weights, run the following command:

This will train the LSTM models for each neighboring stock pair and save the weights to `weights/<TICKER>.pth`

,
the model to `models/<TICKER>.pt`

and scalers to `scalers/<TICKER>/.pkl`

**The models are trained on sequences of 5 consecutive days**.
This was chosen because it is inefficient and impractical to train on shorter sequences because it is hard to make
predictions on a single day's data. Similarly, it is inefficient to train on longer sequences because the number of
possible sequences increases exponentially and the prediction complexity increases with the sequence length.

This is a very computationally expensive process. It took over 7 hours on a 4 core machine with 16 GB RAM laptop and over 3 hours on a 16 core server instance. Currently, the models are trained on CPU only as there are issues training PyTorch LSTM models on GPU.

## Notes

### Using the LSTM Models without the Simulator

The LSTM models can be used without the simulator by using calling the `LSTMModel`

class directly.

Using the provided `load_model_and_scaler`

, the models and scalers can be loaded from the saved files.

Then, we can use the `predict`

method to make predictions for a given input sequence.
The input sequence should be a `numpy`

array of shape `(5, 1)`

because the models were trained on 5 day sequences.

The inputs must be a percentage change from the previous day's closing price. The percentage must be represented as a decimal and not a percentage. Example: 1.23% as 1.23 not 0.0123.

#### Using real data

Assuming the latest data has been downloaded,
we can use the `get_trailing_stock_data`

from `simulator.py`

to get the previous days' data.

To convert the SP500-Cellular-Automata.ipynb notebook to a PDF:

`jupyter nbconvert --to webpdf --allow-chromium-download SP500-Cellular-Automata.ipynb`