Predicting Stock Prices and Corporate Financials Ratios with Facebook Prophet Using Python

I started off wanting to analyze the sustainability of mortgage buydowns by home builders. But as it happens, my ADHD had other plans. Three weeks later, here we are with code that looks for trends and anomalies in corporate financials. The code generates detailed HTML profiles for in-depth financial analysis of stocks.

In a series of previous posts, I’ve explained the extraction of corporate financial data and stock price data. Now, let’s go a step further by utilizing Plotly and Facebook’s Prophet library to predict these financial metrics and stock prices. It’s crucial to note that the goal here isn’t to necessarily predict future values but to model the data in such a way that we can quickly spot any anomalies. In this tutorial, we’ll be extracting data from HDF5 files, which were created in previous posts, so make sure to check those out first.

Generated HTML Output: A Preview of What the Code Produces

The code generates an in-depth profile of any stock specified. In this example, we take a look at Google. The output is a structured HTML file that includes a variety of information sections. First, it provides basic details such as the company’s name, ticker symbol, and the date of the most recent update. Next, it lists essential contact information, including the company’s web URL, phone number, and physical address. The sector and industry in which the company operates, as well as the number of full-time employees, are also displayed. The output then delves into valuation metrics like P/E ratios, enterprise value, and other financial ratios. Finally, it highlights key financial data points, such as market capitalization, EBITDA, and various earnings and revenue estimates. Overall, the code produces a multi-faceted profile that serves as a valuable resource for anyone looking to understand Alphabet Inc’s business and financial standing in great detail.

Forecasting Future Adjusted Close Prices with Facebook Prophet: A One-Year Outlook

This plot visualizes the historical and predicted adjusted close prices for a particular asset, utilizing the Facebook Prophet algorithm for the forecast. The x-axis represents the timeline, extending from the earliest available historical data to one year into the future. The y-axis shows the adjusted close prices. Historical prices are plotted as solid purple lines. The predicted value is a solid green line with blue dotted lines as the high and low predictions. This predictive model leverages the power of Facebook Prophet to analyze seasonal and trend components in the historical data, providing a one-year outlook on potential price movements.

Forecasting Future Financial Ratios with Facebook Prophet: A One-Year Outlook

Financial ratios are key indicators of a company’s financial health and performance. They offer crucial insights into various aspects like profitability, liquidity, and valuation. In this section, we’ll leverage the predictive power of Facebook Prophet to forecast these ratios over the next year.

Forecasting Future Earnings and Profitability Metrics from the Income Statement with Facebook Prophet: A One-Year Outlook

These plots present a comprehensive visualization of key earnings and profitability metrics extracted from the income statement, including Net Income, Operating Income, Total Revenue, Gross Profit, and Free Cash Flow. The financial numbers are then forward projected using Facebook Prophet. These predictions not only display the expected trends for the next year but also encompass upper and lower bounds, providing a range of possible outcomes. This integration of historical data with advanced forecasting techniques offers a nuanced understanding of the company’s financial trajectory.

Forecasting Future Balance Sheet Metrics with Facebook Prophet: A One-Year Outlook

These plots offer an in-depth analysis of crucial balance sheet metrics such as Total Assets, Total Stockholder Equity, Retained Earnings, Long-Term Debt, and Total Liabilities. Leveraging the predictive power of Facebook Prophet, these key financial indicators are forecasted into the future. The projections not only illustrate the expected financial posture for the forthcoming year but also include upper and lower prediction intervals, giving a full spectrum of potential financial scenarios. By melding past balance sheet data with sophisticated predictive modeling, the plots provide a multi-dimensional view of the company’s expected financial stability and risk profile.

Forecasting Future Cash Flow Metrics with Facebook Prophet: A One-Year Outlook

These plots dive into essential cash flow metrics, specifically Total Cash from Operating Activities, Capital Expenditures, and Dividends Paid. These key figures are also extended into the future using Facebook Prophet’s forecasting capabilities. The resulting predictions not only map out the anticipated cash flow movements for the next year but are also bracketed by upper and lower confidence intervals, presenting a comprehensive range of financial possibilities. These plots offer a well-rounded perspective on the company’s future liquidity and capital allocation strategies.

Other Financial Metrics

In addition to the key financial metrics, the analysis delves into a diverse set of other metrics, plotting each meticulously. These metrics are categorized as follows:

Efficiency and Activity Metrics from Income Statement:

  • Research Development
  • Selling General Administrative


Stockholder’s Equity and Capital Structure:

  • Common Stock Shares Outstanding


Additional Important Metrics:

  • Net Debt
  • Net Receivables
  • Inventory
  • Accounts Payable
  • Total Current Assets
  • Total Current Liabilities

Less Critical Metrics:

  • Income Before Tax
  • Cost of Revenue
  • Intangible Assets
  • Earning Assets
  • Other Current Assets
  • Deferred Long-Term Liabilities
  • Other Current Liabilities
  • Common Stock
  • Capital Stock
  • Other Liabilities
  • Goodwill
  • Other Assets
  • Cash
  • Cash and Equivalents
  • Current Deferred Revenue
  • Short-Term Debt
  • Short/Long-Term Debt
  • Short/Long-Term Debt Total
  • Other Stockholder Equity
  • Property Plant Equipment
  • Long-Term Investments
  • Net Tangible Assets
  • Short-Term Investments

Other Metrics:

  • Effect of Accounting Charges
  • Income Tax Expense
  • Non-Operating Income Net Other
  • Selling and Marketing Expenses
  • Common Stock Total Equity
  • Preferred Stock Total Equity
  • Retained Earnings Total Equity
  • Treasury Stock
  • Accumulated Amortization
  • Non-Current Assets Other
  • Deferred Long-Term Asset Charges
  • Non-Current Assets Total
  • Capital Lease Obligations
  • Long-Term Debt Total
  • Non-Current Liabilities Other
  • Non-Current Liabilities Total
  • Negative Goodwill
  • Warrants
  • Preferred Stock Redeemable
  • Capital Surpluse
  • Liabilities And Stockholders Equity
  • Cash And Short-Term Investments
  • Property Plant And Equipment Gross
  • Property Plant And Equipment Net
  • Accumulated Depreciation
  • Total Cash Flows From Investing Activities
  • Total Cash From Financing Activities
  • Net Borrowings
  • Issuance Of Capital Stock
  • Investments
  • Change To Liabilities
  • Change To Operating Activities
  • Change In Cash
  • Begin Period Cash Flow
  • End Period Cash Flow
  • Depreciation
  • Other Cash Flows From Investing Activities
  • Change To Inventory
  • Change To Account Receivables
  • Sale Purchase Of Stock
  • Other Cash Flows From Financing Activities
  • Change To Net Income
  • Change Receivables
  • Cash Flows Other Operating
  • Exchange Rate Changes
  • Cash And Cash Equivalents Changes
  • Change In Working Capital
  • Stock Based Compensation
  • Other Non-Cash Items

Code Overview

Prerequisites

  • Python 3.x
  • Pandas
  • Plotly
  • Prophet
  • Logging
  • Pathlib
  • os
  • Collections
  • bs4 (BeautifulSoup)
  • time
  • scikit-learn

Helper Functions

The code includes several helper functions, such as:

  • calculate_mae: Calculates the Mean Absolute Error between the actual and predicted data.
  • access_hdf5_with_retries: Attempts to read an HDF5 file, retrying up to a specified number of times.
  • read_general_info: Reads general company information from an HDF5 file.
  • infer_dtype: Infers the data type of a Pandas Series.
  • fetch_data_for_symbol_from_multiple_h5: Fetches data for a specific symbol from multiple HDF5 files.
  • get_company_name: Gets the company name for a given symbol from an HDF5 file.
  • add_content_before_plot: Adds additional content before the plot in an HTML file.

Forecasting Function

The core of this code is the forecast_with_multiple_metrics function, which:

  1. Accepts a DataFrame of financial metrics.
  2. Performs time-series forecasting on each metric using Facebook’s Prophet.
  3. Plots the actual and forecasted metrics using Plotly.
def forecast_with_multiple_metrics(df: pd.DataFrame, periods: int = 4, save_dir: str = "plots/", use_all_data: bool = True, eod_price_data: str = 'eod_price_data.h5'):
    ...

Code

import pandas as pd
from pathlib import Path
import logging
import os
from plotly.subplots import make_subplots
from prophet import Prophet
import plotly.graph_objects as go
from collections import Counter
from pandas.api.types import is_numeric_dtype
from bs4 import BeautifulSoup
import time
from sklearn.metrics import mean_absolute_error
from math import log10


from dictionaries_and_lists import metric_definitions, reordered_columns, homebuilders, sp500, companies_with_treasuries, largest_banks, percentage_metrics

data_dir = '/home/shared/algos/data/'
plots_dir = '/home/shared/algos/eodhd_data/plots/'

logging.basicConfig(level=logging.DEBUG)
logging.basicConfig(level=logging.INFO)

# Configure Pandas to display all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

symbol_exchange_map = {}


def wrap_text(text, max_length):
    """
    Wraps text to a new line at the nearest whitespace of the max_length.
    """
    wrapped_lines = []
    while len(text) > max_length:
        # Find nearest whitespace of the max_length
        split_index = text.rfind(' ', 0, max_length + 1)
        if split_index == -1:  # No whitespace found, force split
            split_index = max_length
        wrapped_lines.append(text[:split_index])
        text = text[split_index:].lstrip()
    wrapped_lines.append(text)
    return '<br>'.join(wrapped_lines)

def format_large_number(num):
    print('Attempting to plot large number')
    if num < 1_000:
        return str(num)
    magnitude = int(log10(num) // 3)
    value = num / (10 ** (3 * magnitude))
    return f"{value:.2f}{' KMBT'[magnitude]}"

def format_percentage(data):
    try:
        # Remove any non-numeric characters like commas and percentage signs
        if isinstance(data, str):
            data = data.replace(',', '').replace('%', '').strip()
        # Convert to float and format as a percentage
        return "{:.2%}".format(float(data))
    except ValueError as e:
        print(f"ValueError: Could not convert {data} to a percentage.")
        return data  # Return the original data if it cannot be converted


def calculate_mae(actual, predicted):
    return mean_absolute_error(actual, predicted)

def access_hdf5_with_retries(hdf5_file_path, mode, max_retries=3, sleep_duration=5):
    retries = 0
    while retries < max_retries:
        try:
            with pd.HDFStore(hdf5_file_path, mode) as store:
                return store.keys()  # You can customize this part to return what you need
            break  # If successful, exit the while loop
        except Exception as e:  # Replace Exception with a more specific exception if possible
            if retries < max_retries - 1:
                logging.info(f"An exception occurred while reading {hdf5_file_path}: {e}")
                logging.info(f"Retrying in {sleep_duration} seconds...")
                time.sleep(sleep_duration)
                retries += 1
            else:
                logging.error(f"Max retries reached. Could not read {hdf5_file_path}. Exiting...")
                raise

def read_general_info(symbol, h5_general_path, max_retries=3, sleep_duration=5):
    retries = 0
    while retries < max_retries:
        try:
            with pd.HDFStore(h5_general_path, 'r') as store:
                key = f'/{symbol}'
                if key in store.keys():
                    general_info = store.get(key)
                    info_dict = dict(zip(general_info['SubCategory'], general_info['Data']))
                    sector = info_dict.get('Sector', 'N/A')
                    industry = info_dict.get('Industry', 'N/A')
                    description = info_dict.get('Description', 'N/A')
                    full_time_employees = info_dict.get('FullTimeEmployees', 'N/A')
                    updated_at = info_dict.get('UpdatedAt', 'N/A')
                    web_url = info_dict.get('WebURL', 'N/A')
                    phone = info_dict.get('Phone', 'N/A')
                    address = info_dict.get('Address', 'N/A')
                    name = info_dict.get('Name', 'N/A')
                    exchange = info_dict.get('Exchange', 'N/A')
                    return sector, industry, description, full_time_employees, updated_at, web_url, phone, address, name, exchange
                else:
                    return ['N/A']*10
            break
        except Exception as e:
            if retries < max_retries - 1:
                print(f"An exception occurred while reading {h5_general_path}: {e}")
                print(f"Retrying in {sleep_duration} seconds...")
                time.sleep(sleep_duration)
                retries += 1
            else:
                print(f"Max retries reached. Could not read {h5_general_path}. Exiting...")
                raise



def infer_dtype(series):
    sample = series.dropna().head(100)  # Sample the first 100 non-null rows

    # Debugging line
    print(f"Sample for {series.name}: {sample}")

    if sample.empty:
        return None

    if all(sample.apply(lambda x: isinstance(x, str))):
        return 'object'

    if is_numeric_dtype(sample) and all(sample.apply(lambda x: x == int(x))):
        return 'int64'

    if is_numeric_dtype(sample):
        return 'float64'
    return None


def fetch_data_for_symbol_from_multiple_h5(h5_filepaths, symbol, max_retries=3, sleep_duration=5):
    combined_data = None
    common_keys = ['Symbol', 'date', 'filing_date', 'currency_symbol']

    for h5_filepath in h5_filepaths:
        h5_path = Path(h5_filepath)
        if not h5_path.exists():
            logging.info(f"The file {h5_filepath} does not exist.")
            continue

        retries = 0
        while retries < max_retries:
            try:
                with pd.HDFStore(h5_filepath, 'r') as store:
                    if f"/{symbol}" in store.keys():
                        logging.info(f"Symbol {symbol} found in {h5_filepath}.")
                        symbol_data = store.select(symbol)

                        if 'date' not in symbol_data.columns and 'Date' not in symbol_data.columns:
                            if isinstance(symbol_data.index, pd.DatetimeIndex):
                                symbol_data.reset_index(inplace=True)
                                symbol_data.rename(columns={'index': 'date'}, inplace=True)
                                logging.info(
                                    f"'date' and 'Date' columns not found, but datetime index exists in {h5_filepath} for symbol {symbol}.")
                                symbol_data = symbol_data.reset_index().rename(columns={'index': 'date'})
                            else:
                                logging.warning(
                                    f"'date' and 'Date' columns and datetime index not found in {h5_filepath} for symbol {symbol}.")
                                break
                        else:
                            # Ensure the date column is standardized to 'date'
                            if 'Date' in symbol_data.columns:
                                symbol_data.rename(columns={'Date': 'date'}, inplace=True)


                        # Rename 'netIncome' if it's present in the dataframe
                        if 'netIncome' in symbol_data.columns:
                            # Extract the type of financial statement from the filename
                            parts = h5_path.stem.split('_')
                            statement_type = parts[1]
                            metric_name = parts[2]

                            statement_readable = {
                                "Cash": "Cash Flow from Operating Activities",
                                "Income": "netIncome",
                            }.get(statement_type, statement_type)

                            metric_title = f"{metric_name} ({statement_readable})"

                            # Ensure statement_type is one of the reports where 'netIncome' appears before renaming
                            if statement_type in ["Income", "Cash"]:
                                symbol_data.rename(columns={'netIncome': metric_title}, inplace=True)

                        if combined_data is None:
                            combined_data = symbol_data
                        else:
                            # Dynamically adjust common keys based on available columns
                            keys_for_merge = [key for key in common_keys if
                                              key in symbol_data.columns and key in combined_data.columns]
                            combined_data = pd.merge(combined_data, symbol_data, on=keys_for_merge, how='outer')
                        break

                    else:
                        logging.info(f"Symbol {symbol} not found in {h5_filepath}.")
                        break

            except Exception as e:
                if retries < max_retries - 1:
                    logging.info(f"An exception occurred while reading {h5_filepath}: {e}")
                    logging.info(f"Retrying in {sleep_duration} seconds...")
                    time.sleep(sleep_duration)
                    retries += 1
                else:
                    logging.info(f"Max retries reached. Could not read {h5_filepath}. Exiting...")
                    raise

    if combined_data is None:
        logging.info(f"Symbol {symbol} not found in any of the HDF5 files.")
        return None
    else:
        for col in combined_data.select_dtypes(include=['object']).columns:
            combined_data[col] = pd.to_numeric(combined_data[col], errors='ignore')

        if 'date' in combined_data.columns:
            combined_data.sort_values(by='date', inplace=True)
        else:
            logging.info(f"'date' column not found in combined data for symbol {symbol}. Sorting by index instead.")
            combined_data.sort_index(inplace=True)
        logging.info(f"{symbol}: successfully combined all HDF5 files.")
        return combined_data



def get_company_name(symbol, h5_file_path, country_code='US'):
    try:
        # Read the DataFrame for the entire country from the HDF5 file
        symbols_df = pd.read_hdf(h5_file_path, key=f'/{country_code}')

        # Filter by the specific symbol to get the company name
        company_name_row = symbols_df[symbols_df['Code'] == symbol]
        if not company_name_row.empty:
            return company_name_row['Name'].iloc[0]
        else:
            logging.error(f"No data for symbol {symbol} in the DataFrame.")
            return "Unknown"
    except KeyError:
        logging.error(f"No object named {country_code} in the file {h5_file_path}")
        return "Unknown"


def add_content_before_plot(symbol, max_retries=3, sleep_duration=5):
    target_filename = f"{symbol}_quarterly.html"
    html_file_path = os.path.join(plots_dir, target_filename)

    if not os.path.exists(html_file_path):
        print(f"HTML file for symbol {symbol} not found.")
        return

    # Fetch additional info from General.h5
    exchange = symbol_exchange_map.get(symbol, 'Other')  # Default to 'Other' if exchange is not found
    h5_general_path = os.path.join(data_dir, f"{exchange}_General.h5")

    sector, industry, description, full_time_employees, updated_at, web_url, phone, address, name, exchange = read_general_info(symbol, h5_general_path, max_retries, sleep_duration)

    # Prepare new HTML content
    new_html_content = f"""
    <h1>{symbol} - {name} - {exchange}</h1>
    <p>Updated At: {updated_at}</p>
    <p><strong style='font-size: larger;'>WebURL:</strong> <a href='{web_url}' target='_blank'>{web_url}</a>
    <strong style='font-size: larger;'>Phone:</strong> {phone}
    <strong style='font-size: larger;'>Address:</strong> {address}<br>
    <strong style='font-size: larger;'>Sector:</strong> {sector}
    <strong style='font-size: larger;'>Industry:</strong> {industry}<br>
    <strong style='font-size: larger;'>Full Time Employees:</strong> {full_time_employees}</p><br>
    <strong style='font-size: larger;'>Description:</strong> {description}<br><br>
    """

    # Read valuation.h5 data for the symbol
    h5_valuation_path = os.path.join(data_dir, f"{exchange}_Valuation.h5")

    retries = 0
    valuation_data = None
    while retries < max_retries:
        try:
            with pd.HDFStore(h5_valuation_path, 'r') as store:
                if symbol in store:
                    valuation_data = store.get(symbol)
                else:
                    valuation_data = pd.DataFrame(columns=['SubCategory', 'Data'])
                    print(f"The symbol {symbol} valuation data does not exist in {h5_valuation_path}")
            break
        except Exception as e:
            if retries < max_retries - 1:
                print(f"An exception occurred while reading {h5_valuation_path}: {e}")
                print(f"Retrying in {sleep_duration} seconds...")
                time.sleep(sleep_duration)
                retries += 1
            else:
                print(f"Max retries reached. Could not read {h5_valuation_path}. Exiting...")
                raise

    # Add valuation_data to new_html_content
    valuation_html = "<h2>Valuation</h2><div style='display: flex; flex-wrap: wrap;'>"
    if valuation_data is not None:
        print(f"valuation_data: {valuation_data}")
        valuation_data['Data'] = pd.to_numeric(valuation_data['Data'], errors='coerce')

        for index, row in valuation_data.iterrows():
            data = row['Data']
            if pd.notnull(data):  # Check if 'Data' is not NaN
                if index in percentage_metrics:
                    formatted_data = "{:.2%}".format(float(data))
                elif isinstance(data, (int, float)) and abs(data) >= 1_000:  # Large numbers
                    formatted_data = format_large_number(data)
                elif isinstance(data, (int, float)):
                    print('formatting numbers with commas in Valuation')
                    print(f"data: {data}, type: {type(data)}")
                    formatted_data = f"{data:,.2f}"  # For other numbers, just format with commas
                else:
                    formatted_data = data  # For non-numeric data, leave as is
            else:
                # If data is NaN or non-numeric, leave as is
                formatted_data = row['Data']

            valuation_html += f"<div style='flex: 0 0 calc(33.333% - 10px); margin-right: 10px; margin-bottom: 10px;'>"
            valuation_html += f"<strong>{row['SubCategory']}</strong>: {formatted_data}</div>"
    valuation_html += "</div>"

    new_html_content += valuation_html  # Append valuation data to new_html_content



    # Read highlights.h5 data for the symbol
    h5_highlights_path = os.path.join(data_dir, f"{exchange}_Highlights.h5")
    retries = 0
    while retries < max_retries:
        try:
            with pd.HDFStore(h5_highlights_path, 'r') as store:
                if symbol in store:
                    highlights_data = store.get(symbol)
                else:
                    highlights_data = pd.DataFrame(columns=['SubCategory', 'Data'])
                    print(f"The symbol {symbol} highlights data does not exist in {h5_highlights_path}")
            break
        except Exception as e:
            if retries < max_retries - 1:
                print(f"An exception occurred while reading {h5_highlights_path}: {e}")
                print(f"Retrying in {sleep_duration} seconds...")
                time.sleep(sleep_duration)
                retries += 1
            else:
                print(f"Max retries reached. Could not read {h5_highlights_path}. Exiting...")
                raise

    # Add title before highlights
    new_html_content += "<h2>Highlights</h2>"
    # Add highlights_data to new_html_content
    highlights_html = "<div style='display: flex; flex-wrap: wrap;'>"

    highlights_data['Data'] = highlights_data['Data'].apply(pd.to_numeric, errors='ignore')

    for index, row in highlights_data.iterrows():
        subcategory = row['SubCategory']
        data = row['Data']

        if subcategory in percentage_metrics:
            formatted_data = format_percentage(data)
        elif pd.notnull(data) and isinstance(data, (int, float)):
            if abs(data) >= 1_000:  # Large numbers
                print(f"{symbol}: {subcategory} is a large number. Formatting {data} with commas...")
                formatted_data = format_large_number(data)
            else:  # Other numeric data that is not a large number
                print(f"{symbol}: {subcategory} is a numeric value. Formatting {data} with commas...")
                formatted_data = f"{data:,.2f}"
        else:
            print(f'{symbol}: {subcategory} is a non-numeric value. Data: {data}')
            formatted_data = data  # For non-numeric data, leave as is

        highlights_html += f"""<div style='flex: 0 0 calc(33.333% - 10px); margin-right: 10px; margin-bottom: 10px;'>
                                <strong>{subcategory}</strong>: {formatted_data}
                               </div>"""
    highlights_html += "</div>"
    new_html_content += highlights_html  # App

    # Add the disclaimer to the end of new_html_content
    disclaimer = "<br><br><p><strong>Disclaimer:</strong> The last data point has been excluded from the Prophet prediction.</p>"
    new_html_content += disclaimer


    with open(html_file_path, 'r', encoding='utf-8') as f:
        html_content = f.read()

    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find the div that contains the Plotly plot
    plot_div = soup.find('div', {'class': 'plotly-graph-div'})

    if plot_div:
        # Create a BeautifulSoup object from the new_content string
        new_content_soup = BeautifulSoup(new_html_content, 'html.parser')

        # Insert the new content before the Plotly plot
        plot_div.insert_before(new_content_soup)

        # Save the modified HTML back to disk
        with open(html_file_path, 'w', encoding='utf-8') as f:
            f.write(str(soup))
    else:
        print(f"Plotly plot not found in the HTML file {html_file_path}.")


def forecast_with_multiple_metrics(df: pd.DataFrame, periods: int = 4, save_dir: str = "plots/", use_all_data: bool = True, eod_price_data: str = 'eod_price_data.h5'):
    def plot_adjusted_close_from_h5(fig, symbol, hdf5_file_path, max_retries=3, sleep_duration=5):
        hdf5_file_path = data_dir + hdf5_file_path
        keys = access_hdf5_with_retries(hdf5_file_path, 'r', max_retries, sleep_duration)

        if f'/{symbol}' in keys:
            with pd.HDFStore(hdf5_file_path, 'r') as store:
                stock_data = store.get(symbol)

                # Add actual adjusted_close to the plot
                fig.add_trace(
                    go.Scatter(x=stock_data.index, y=stock_data['Adjusted_close'],
                               mode='lines+markers',
                               name='Actual Adjusted_close',
                               legendgroup='Actual',
                               line=dict(color='purple'),
                               showlegend=True),
                    row=1, col=1
                )

                prophet_df = stock_data.reset_index()[['Date', 'Adjusted_close']].rename(
                    columns={'Date': 'ds', 'Adjusted_close': 'y'})

                best_mae = float('inf')
                best_forecast = None
                best_mode = None

                for mode in ['additive', 'multiplicative']:
                    model = Prophet(
                        seasonality_mode=mode,
                        yearly_seasonality=True,
                        weekly_seasonality=False,
                        daily_seasonality=False)
                    model.fit(prophet_df)
                    future = model.make_future_dataframe(periods=365)  # 1-year prediction
                    forecast = model.predict(future)

                    mae = calculate_mae(prophet_df['y'], forecast.loc[:len(prophet_df) - 1, 'yhat'])

                    if mae < best_mae:
                        best_mae = mae
                        best_forecast = forecast
                        best_mode = mode

                # Add the best forecast to the plot
                fig.add_trace(
                    go.Scatter(x=best_forecast['ds'], y=best_forecast['yhat'],
                               mode='lines',
                               name='Adjusted_close Forecast',
                               legendgroup='Forecast',
                               line=dict(color='green'),
                               showlegend=True),
                    row=1, col=1
                )

                # Add yhat_upper and yhat_lower to the plot
                fig.add_trace(
                    go.Scatter(x=best_forecast['ds'], y=best_forecast['yhat_upper'],
                               mode='lines',
                               name='Upper Forecast',
                               legendgroup='Upper Forecast',
                               line=dict(color='blue', dash='dash'),
                               showlegend=True),
                    row=1, col=1
                )

                fig.add_trace(
                    go.Scatter(x=best_forecast['ds'], y=best_forecast['yhat_lower'],
                               mode='lines',
                               name='Lower Forecast',
                               legendgroup='Lower Forecast',
                               line=dict(color='blue', dash='dash'),
                               showlegend=True),
                    row=1, col=1
                )

        else:
            logging.warning(f"No Adjusted_close data found for symbol: {symbol}")

    pd.set_option('display.max_columns', None)
    logging.debug(df.columns.tolist())

    # Get the columns that are both in df.columns and reordered_columns
    common_columns = [col for col in reordered_columns if col in df.columns]

    # Check if common_columns is not empty
    if common_columns:
        # Reorder the DataFrame using the common columns
        df = df[common_columns + [col for col in df.columns if col not in common_columns]]
    else:
        logging.warning("No common columns between df and reordered_columns.")

    metrics = [col for col in df.columns if df[col].dtype in ['int64', 'float64']]

    logging.info(metrics)

    subplot_titles = []
    for metric in metrics:
        definition = metric_definitions.get(metric, 'No definition available')
        wrapped_definition = wrap_text(definition, 160)  # Adjust the max_length as needed
        title = f"<b>{metric}</b><br><span style='font-size: smaller;'>{wrapped_definition}</span>"
        subplot_titles.append(title)

    fig = make_subplots(rows=len(subplot_titles) + 1, cols=1, subplot_titles=['Adjusted_close'] + subplot_titles)

    symbol = df['Symbol'].iloc[0]

    plot_adjusted_close_from_h5(fig, symbol, eod_price_data)




    # Fetch the company_name
    h5_symbols = os.path.join(data_dir, "symbols.h5")
    company_name = get_company_name(symbol, h5_symbols)

    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    if not use_all_data:
        disclaimer = "Disclaimer: The last data point has been excluded from the Prophet prediction. \n "
        disclaimer += "There may not be enough data points for Prophet to make accurate predictions. \n"
        fig.add_annotation(
            dict(
                x=0,
                y=1.1,
                xref="paper",
                yref="paper",
                text=disclaimer,
                showarrow=False,
                font=dict(size=16)
            )
        )


    for i, metric in enumerate(metrics):
        row = i + 2

        plotting_df = df[['date', metric]].copy().dropna()
        prophet_df = plotting_df.copy()

        # Convert the specified metrics to percentages by multiplying by 100
        if metric in ['ROE', 'Earnings_Yield', 'Dividend_Yield']:
            prophet_df[metric] = prophet_df[metric] * 100  # Convert to percentage
            plotting_df[metric] = plotting_df[metric] * 100  # Convert to percentage


        if not use_all_data:
            prophet_df = prophet_df.iloc[:-1, :]

        prophet_df.rename(columns={'date': 'ds', metric: 'y'}, inplace=True)

        if prophet_df.shape[0] < 2:
            logging.warning(f"Skipping {metric} because it has less than 2 non-NaN rows.")
            continue

        best_mae = float('inf')
        best_forecast = None
        best_mode = None

        for mode in ['additive', 'multiplicative']:
            model = Prophet(
                seasonality_mode=mode,
                yearly_seasonality=True,
                weekly_seasonality=False,
                daily_seasonality=False)
            model.fit(prophet_df)
            future = model.make_future_dataframe(periods=periods, freq='Q')
            forecast = model.predict(future)

            mae = calculate_mae(prophet_df['y'], forecast.loc[:len(prophet_df) - 1, 'yhat'])

            if mae < best_mae:
                best_mae = mae
                best_forecast = forecast
                best_mode = mode

        logging.info(f"Best seasonality mode for {metric} is {best_mode} with MAE {best_mae}")


        hover_format = '%{x}: %{y:.2f}%' if metric in ['ROE', 'Earnings_Yield', 'Dividend_Yield'] else '%{x}: %{y:,}'

        fig.add_trace(
            go.Scatter(x=plotting_df['date'], y=plotting_df[metric],
                       mode='lines+markers',
                       name=f'Actual {metric}',
                       legendgroup='Actual',
                       line=dict(color='purple'),
                       showlegend=(i == 0),
                       hovertemplate=hover_format),
            row=row, col=1
        )

        # Plot the forecasted metric data
        fig.add_trace(
            go.Scatter(x=best_forecast['ds'],
                       y=best_forecast['yhat'],
                       mode='lines+markers',
                       name=f'Forecasted {metric}',
                       legendgroup='Forecast',
                       line=dict(color='green'),
                       showlegend=(i == 0),
                       hovertemplate=hover_format),
            row=row, col=1
        )

        # Plot the upper forecast interval
        fig.add_trace(
            go.Scatter(x=best_forecast['ds'], y=best_forecast['yhat_upper'],
                       mode='lines',
                       name=f'Upper Bound {metric}',
                       legendgroup='Upper Forecast',
                       line=dict(color='blue', dash='dash'),
                       showlegend=(i == 0),
                       hovertemplate=hover_format),
            row=row, col=1
        )

        # Plot the lower forecast interval
        fig.add_trace(
            go.Scatter(x=best_forecast['ds'], y=best_forecast['yhat_lower'],
                       mode='lines',
                       name=f'Lower Bound {metric}',
                       legendgroup='Lower Forecast',
                       line=dict(color='blue', dash='dash'),
                       showlegend=(i == 0),
                       hovertemplate=hover_format),
            row=row, col=1
        )


    fig.update_layout(
        height=350 * len(metrics),
        width=1800,
        title_font_size=16  # You can change this value as needed
    )

    plot_file_path = os.path.join(save_dir, f"{symbol}_quarterly.html")

    fig.write_html(plot_file_path)
    logging.info(f"Saved Prophet plot for {company_name} to {plot_file_path}")

    add_content_before_plot(symbol)

def get_symbols(h5_file_path, key='US'):
    """
    Open an HDF5 file and populate the global dictionary symbol_exchange_map
    where the symbol is the key and the exchange is the value.

    Parameters:
        h5_file_path (str): The path to the HDF5 file.
        key (str): The key to use when reading the HDF5 file. Default is 'US'.

    Returns:
        None
    """

    h5_file_path = Path(h5_file_path)

    # Check if the file exists
    if not h5_file_path.exists():
        logging.info(f"The file {h5_file_path} does not exist.")
        return

    try:
        # Read the DataFrame from the HDF5 file
        df = pd.read_hdf(h5_file_path, key=key)

        # Check if 'Code' and 'Exchange' columns exist
        if 'Code' not in df.columns or 'Exchange' not in df.columns:
            logging.info(f"The 'Code' or 'Exchange' column does not exist in the DataFrame.")
            return

        # Populate the global symbol_exchange_map
        global symbol_exchange_map
        symbol_exchange_map = dict(zip(df['Code'], df['Exchange']))
        return list(symbol_exchange_map.keys())
    except Exception as e:
        logging.error(f"An error occurred: {e}")
        return

symbols = get_symbols(data_dir + 'symbols.h5', key='US')

symbols = ['UBER', 'LYFT', 'WE', 'IEP', 'AAPL'] + sp500 + largest_banks + companies_with_treasuries + homebuilders
symbols = ['GDHG', 'IEP']
symbols = ['GOOG']



h5_files = set()  # Use a set to automatically handle duplicates

for symbol in symbols:
    exchange = symbol_exchange_map.get(symbol, 'Other')  # Default to 'Other' if not found
    current_h5_files = [
        f"{data_dir}/{exchange}_Cash_Flow_quarterly.h5",
        f"{data_dir}/{exchange}_Balance_Sheet_quarterly.h5",
        f"{data_dir}/{exchange}_Income_Statement_quarterly.h5",
        f"{data_dir}/{exchange}_Ratios.h5"
    ]
    h5_files.update(current_h5_files)  # Update the set with the new file paths

    logging.info(f'Processing {symbol}')
    try:
        # Pass only the HDF5 files corresponding to the current symbol's exchange
        df = fetch_data_for_symbol_from_multiple_h5(list(current_h5_files), symbol, max_retries=3, sleep_duration=5)


        if df is None:
            logging.warning(f"No data found for symbol {symbol}. Skipping to the next symbol.")
            continue  # Skip to the next iteration of the loop    forecast_with_multiple_metrics(df, periods=4, save_dir="plots/", use_all_data=False)

        logging.info(df.head(20))
        forecast_with_multiple_metrics(df)
    except Exception as e:
        logging.error(f"An unexpected error occurred while processing symbol {symbol}: {e}")

Leave a Reply