Kaggle – Abdominal Trauma prediction – early experiences and errors

I recently jumped into an abdominal trauma predictive model Kaggle competition by the Radiological Society of North America and will share a few early challenges and thoughts.

I am surprised there are less than 1,000 competitors in a free classification competition using computed tomography (CT) images. I found this to be an immediately interesting opportunity. Available Python notebooks make getting started quite easy, although I will discuss some challenges quickly encountered.

Analyze CT images using AI with plenty of community help

I quickly encountered an error when I began working with an available Keras notebook. As I have seen in all Kaggle competitions, there is an active, helpful community. One member provided the solution for an error encountered with the first record in the test data (patient 48843).

This line to query the dataframe for a particular patient in a for loop:

patient_df = test_df.query("patient_id == @patient_id")

had to become:

patient_df = test_df[test_df["patient_id"] == patient_id]

I am still much stronger in R than Python. The solution reminded me of this code using base R:

# Query the dataframe for a particular patient
patient_df <- test_df[test_df$patient_id == patient_id,]

Studying the use of enumerate in the containing Python code block caused me to learn that a similar functionality is offered in R using indexed references, which I have used often.

I look forward to spending more time with the Keras (neural networks) prediction. Lower scores are better in this competition and the 12.16 score it yielded was far worse than the 0.66 scored by simply using weighted means as Vishak K Bhat did. Also surprising is that both methods yielded constant predictions.

I thought to average the two and then adjust the weighting between each to see how that affected the score. And that is where I am stuck for now because my R notebook is yielding a submission scoring error. I used R because I know how to quickly calculate the mean of two data frame’s predictions: add the data frames together and then divide by the number of data frames (2).

Remember that we have only discussed a constant of predictions so far (same prediction in each column). You will also note there are only 3 patients even through the training dataset has 3,147 patients. The final test data will have ~1,100 patients and we are currently scored (“public score”) based on ~36% of the test data. So much training data (460 GB!) and we are scored based on a 3 row file that is less than 1 KB until the competition ends. This makes me laugh and not in a mocking way. Part of the fun of Kaggle competitions, and data science in general, is minding for meaning despite the limitations. Cross-validation will obviously be valuable.

Some concepts I will be exploring in the month of competition that remains:

  • Cross-validation and its impact on the public score (somehow the 3-row file is mathematically processed to yield predictions for ~36% of the test data)
  • These are 3D images and there’s talk of a useful 2.5D method.
  • Image data (DICOM tags are in a parquet file and there are NIFTI files)
  • Espadon: a new R package for DICOM files
    • This will likely mean seeing if I can get an R package added to a Kaggle notebook. These files are too large for my computer and the Espadon paper speaks of huge processing needs.
  • Anomaly detection using local Kevin Feasel’s Finding Ghosts in Your Data (2023), which is on sale (unlike when I purchased , haha)
    • I would like to see if a technique in that book can catch the corrupt image that I believe does not otherwise cause an error

Speaking of Kevin Feasel, I am scheduled to speak about my Kaggle data science experiences, including the current RSNA Abdominal Trauma prediction competition, on October 24th for the Triangle SQL Server User Group he leads (I’m a board member)!

Leave a comment