100% ACCURACY Earthquake Type Prediction - Data Every Day #058

Gabriel Atkin33 minutes read

A data set of significant earthquakes from 1965 to 2016 with 21 attributes is used to predict if an earthquake is automatic or reviewed using TensorFlow artificial neural network, involving importing standard libraries, data preprocessing, feature engineering, and model building. The model achieves a perfect AUC of 1.0 on the test set after training, correctly classifying all examples due to the combination of features rather than a single predictor.

Insights

  • Feature engineering plays a crucial role in preparing the earthquake data for analysis, involving the creation of new columns for month and year, extraction of the hour from the time column, and dropping irrelevant columns to enhance the model's predictive capabilities.
  • The use of artificial neural networks, specifically TensorFlow, in earthquake classification showcases the power of machine learning techniques in accurately predicting earthquake status (automatic or reviewed) by leveraging extensive data preprocessing, feature engineering, and model training, ultimately achieving outstanding performance with an AUC of 1.0 on the test set.

Get key ideas from YouTube videos. It’s free

Recent questions

  • How was the earthquake data preprocessed?

    The earthquake data was preprocessed by handling missing values through imputing the mean for the "root mean square" column and dropping rows with missing values in the "magnitude type" column. Additionally, feature engineering was conducted by creating new columns for month and year from the "date" column, converting them to integers, and extracting the hour from the "time" column to convert it to integers.

  • What libraries were used for data processing?

    The standard libraries used for data processing and visualization included numpy, pandas, matplotlib, and seaborn. These libraries were essential for importing, processing, and visualizing the earthquake data set to prepare it for model building and analysis.

  • How was the earthquake data visualized?

    The earthquake data was visualized using seaborn, with a heatmap of correlations and kernel density estimation plots for numeric columns. The heatmap displayed correlation values between different attributes, while the kernel density estimation plots helped in understanding the distributions of each feature in the data set.

  • What encoding technique was applied to text columns?

    One-hot encoding was applied to text columns with more than two unique values. This technique created dummy columns for each unique value in the text columns, ensuring that the data was appropriately encoded for further processing and analysis.

  • How was the earthquake prediction model evaluated?

    The earthquake prediction model was evaluated using an Adam optimizer, binary cross entropy as the loss function, and the AUC metric to assess performance across classes and classification thresholds. The model was trained with a batch size of 32, 30 epochs, and a callback function to aid convergence, resulting in a high AUC of 1.0 on the test set, correctly classifying all 7022 examples.

Related videos

Summary

00:00

Predicting Earthquake Classification with TensorFlow Neural Network

  • Data set of significant earthquakes from 1965 to 2016 with 21 attributes, including a "status" column indicating automatic or reviewed classification.
  • Objective to predict if an earthquake is automatic or reviewed using TensorFlow artificial neural network.
  • Importing standard libraries like numpy, pandas, matplotlib, and seaborn for processing and visualization.
  • Preprocessing data using StandardScaler and train_test_split functions from sklearn.
  • Building the model with TensorFlow after loading data using pandas read_csv function.
  • Initial data check reveals 23,000 examples with 21 columns, some containing missing values.
  • Identifying columns with more than 66% missing values to drop them.
  • Handling missing values by imputing the mean for the "root mean square" column and dropping rows with missing values in the "magnitude type" column.
  • Feature engineering by creating new columns for month and year from the "date" column and converting them to integers.
  • Further feature engineering by extracting the hour from the "time" column and converting it to integers, dropping the original "time" and "date" columns.
  • Visualizing the data with seaborn, creating a heatmap of correlations and kernel density estimation plots for numeric columns.

19:27

Visualizing Data with Heat Maps and TensorFlow

  • Plotting the data as a heat map involves setting up a new matplotlib figure with a size of 12x10 and using sns.heatmap to display correlation values.
  • Annotations are turned on to show correlation values on the squares, and the min and max values are set to display the full range of correlations.
  • The status column, which is the target for prediction, contains text values that need to be converted to numerical values (0 or 1) for visualization.
  • The year extracted from the data shows a high correlation (0.58) with the status column, indicating that as the year increases, the number of reviewed cases also increases.
  • Kernel density estimation plots of each feature are suggested to understand their distributions, with data temporarily scaled for consistency.
  • Standardizing the data is necessary to view all features on the same scale, using the StandardScaler from sklearn to center them at mean zero and give them unit variance.
  • One-hot encoding is applied to text columns with more than two unique values, creating dummy columns for each unique value.
  • Checking for identical columns like location source and magnitude source is done to ensure data integrity before encoding.
  • The data is encoded using one-hot encoding for text columns, resulting in 105 columns with encoded data ready for further processing.
  • The data is split into training and testing sets, scaled using StandardScaler, and a TensorFlow model is constructed with input, hidden, and output layers for classification.

39:38

"Optimized model achieves perfect classification accuracy"

  • The model is compiled using an Adam optimizer, binary cross entropy as the loss function, and the AUC metric instead of accuracy to evaluate performance across classes and classification thresholds.
  • The model is trained with a batch size of 32, 30 epochs, and a callback function to help convergence, with the fit history stored in a variable.
  • After training, a plot is generated showing the loss and AUC over time, revealing that the model performed exceptionally well with an AUC of 1.0 on the test set, correctly classifying all 7022 examples due to the combination of features rather than a single predictor.
Channel avatarChannel avatarChannel avatarChannel avatarChannel avatar

Try it yourself — It’s free.