Recently, I encountered an unexpected challenge: a water leak beneath the slab of my house. The ordeal had me up until 1 AM, rerouting the line through my attic with PEX piping. Amidst this late-night task, a thought occurred to me: could machine learning and forecasting have helped me detect this leak earlier, based on my water bill consumption?
I wrote some Python code outlined below that uses statmodels and SARIMAX to predict consumption.
I now wonder why municipalities aren’t incorporating machine learning into data like this to send notices to customers in advance of potential leaks. I imagine this could save millions of gallons of water each year. Full code and explanation follows.
Data Upload and Preparation:
The program starts by uploading a CSV file containing water usage data (in gallons) and the corresponding dates. The CSV must have two column titles date and gallons in order for this to work. This data is then processed to ensure it’s in the correct format. Dates are sorted, and any missing values are filled to maintain continuity.
Creating a Predictive Model:
I used the SARIMAX model from the statsmodels library, a powerful tool for time series forecasting. The model considers both the seasonal nature of water usage and any underlying trends or cycles.
Making Predictions and Comparisons:
The program forecasts future water usage and compares it with actual data.
By analyzing past consumption, it can predict what typical usage should look like and flag any significant deviations.
Visualizing the Data:
The real power of this program lies in its visualization capabilities.
Using Plotly, a versatile graphing library, the program generates an interactive chart. It not only shows actual water usage but also plots predicted values and their confidence intervals.
Highlighting Historical Data:
To provide context, the chart also includes historical data as reference points. These are shown as small horizontal lines, representing the same month in previous years.
Code (Google Colab)
!pip install plotly !pip install statsmodels from google.colab import files import io import pandas as pd uploaded = files.upload() # Use the name of the first uploaded file filename = next(iter(uploaded)) df = pd.read_csv(io.BytesIO(uploaded[filename])) df = df[['date', 'gallons']] # Convert the date column to datetime df['date'] = pd.to_datetime(df['date']) df.sort_values(by='date', inplace=True) df.set_index('date', inplace=True) df = df.asfreq('D') df['gallons'].fillna(method='ffill', inplace=True) df = df.asfreq('M') import plotly.graph_objects as go import pandas as pd from statsmodels.tsa.statespace.sarimax import SARIMAX # SARIMA Model for Forecasting model = SARIMAX(df['gallons'], order=(1, 0, 1), seasonal_order=(1, 1, 1, 12)) results = model.fit() # In-sample predictions in_sample_predictions = results.get_prediction(start=pd.to_datetime(df.index), end=pd.to_datetime(df.index[-1]), dynamic=False) predicted_mean_in_sample = in_sample_predictions.predicted_mean in_sample_conf_int = in_sample_predictions.conf_int() # Forecasting for future periods (e.g., the next 12 months) forecast = results.get_forecast(steps=12) predicted_mean_forecast = forecast.predicted_mean forecast_conf_int = forecast.conf_int() # Prepare the figure fig = go.Figure() # Predicted data (in-sample) and confidence intervals fig.add_trace(go.Scatter(x=predicted_mean_in_sample.index, y=predicted_mean_in_sample, mode='lines', name='Predicted (In-Sample)', line=dict(color='orange'))) fig.add_trace(go.Scatter(x=in_sample_conf_int.index, y=in_sample_conf_int['upper gallons'], fill=None, mode='lines', line=dict(color='lightgray'), showlegend=False)) fig.add_trace(go.Scatter(x=in_sample_conf_int.index, y=in_sample_conf_int['lower gallons'], fill='tonexty', mode='lines', line=dict(color='lightgray'), showlegend=False, name='Predicted CI')) # Forecasted data (out-of-sample) and confidence intervals fig.add_trace(go.Scatter(x=predicted_mean_forecast.index, y=predicted_mean_forecast, mode='lines', name='Forecast (Out-of-Sample)', line=dict(color='green'))) fig.add_trace(go.Scatter(x=forecast_conf_int.index, y=forecast_conf_int['upper gallons'], fill=None, mode='lines', line=dict(color='lightgray'), showlegend=False)) fig.add_trace(go.Scatter(x=forecast_conf_int.index, y=forecast_conf_int['lower gallons'], fill='tonexty', mode='lines', line=dict(color='lightgray'), showlegend=False, name='Forecast CI')) # Actual data (make it bolder and on top) fig.add_trace(go.Scatter(x=df.index, y=df['gallons'], mode='lines', name='Actual', line=dict(color='blue', width=3))) # Adding Previous Years' data as small horizontal lines legend_added = False for current_date in df.index.union(predicted_mean_forecast.index): current_month, current_year = current_date.month, current_date.year previous_years_data = df[(df.index.month == current_month) & (df.index.year < current_year)] for prev_year_date in previous_years_data.index: y_value = previous_years_data.loc[prev_year_date, 'gallons'] fig.add_shape(type="line", x0=current_date - pd.Timedelta(days=5), y0=y_value, x1=current_date + pd.Timedelta(days=5), y1=y_value, line=dict(color="purple", width=2)) if not legend_added: fig.add_trace(go.Scatter(x=[None], y=[None], mode='lines', name='Previous Years', line=dict(color='purple', width=2))) legend_added = True # Update layout fig.update_layout(title='Actual vs Predicted vs Forecasted Water Usage', xaxis_title='Date', yaxis_title='Gallons', hovermode='closest') # Show the plot fig.show()