Your job will be to predict movie revenue from a few different variables, such as the movie’s budget, director and more! This is a great opportunity to gain hands-on experience working on a data science project as well as to win some *seriously cool* prizes. We’re going to tell you about those first to get you excited!


Competition Details

Participation and Timeline

You can participate either individually or in teams of two (and yes, if in a team of two each team member will win the prize!) Please fill out the following Google form before downloading the competition data: Link to Form!

The competition will run until May 1.

Task and Dataset

We’re giving you a dataset of over 3,000 movies with a bunch of information (“features”) about each movie. Your task is to predict the movie’s revenue (in $) from these features. Concretely, here are some of the features we’ll be providing you:

  • Movie Budget
  • Genre(s)
  • Cast
  • Crew
  • Production company
  • Tagline for the movie (if exists)

  • We’ve divided the dataset into two splits: train and test. For the train split, you’ll be provided the actual revenue values for the movies, which you can use to train your machine learning model. For the test split, we give you all the features but not the revenue - you’ll then submit your predictions for the revenue to our system and be evaluated accordingly.

    The dataset splits can be found here: Link to dataset!

    Submission and Evaluation

    Each movie in the dataset is associated with an ID. We ask you to submit your predictions for the test set in a CSV file with the format:

                MovieID, prediction
                MovieID, prediction
    We’ll be evaluating your predictions using RMSLE, or Root Mean-Squared Log Error. This metric computes the squared difference between the log of your prediction and log of the actual value. Here’s the equation for RMSLE, where xi is your prediction and yi is the actual value:
    $$ RMSLE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{(\log(x_i+1) - \log(y_i+1))^2}} $$

    This next part is very important. On the backend we’ve divided up the test set into a public and private test set. When you submit your predictions for the test set, you can either submit to the public or private leaderboard. Submitting to the public leaderboard allows you to see how you measure up to other participants, and you can submit up to ten times per day. However, your final score will be evaluated on the private leaderboard, for which you only have three submissions total. Do not forget to submit to the private leaderboard, but also use your submissions sparingly! The reason we do this is so that participants can’t “overfit” to the public leaderboard by constantly tweaking parameters of their model and resubmitting.

    More details can be found on the EvalAI Competition site (see link above).


    Joyce Luo

    Ellie Bae

    Kevin Huang

    Nabhonil Kar

    Sahil Jain

    Neil Hazra

    Sahil Ambardekar

    Arjun Mani

    Contact Us

    Please do note that Princeton Data Science is a student organization and not a university department. Consequently, we do not sponsor PhD or Masters students.