Great to see.

Wanted to provide a heads up -- and thanks to Lee for the reminder -- that there are a few different team abbreviations used in the provided data.

`FieldPosition`

and `PossessionTeam`

use ARZ, BLT, CLV, HST

`VisitorTeamAbbr`

and `HomeTeamAbbr`

use ARI, BAL, CLE, HOU

This may or may not impact your approach -- in my post here, where I standardize the field so that each possession team is working left to right, manually tweaking these names is helpful.

**Oct 19/Oct 28 update**

A few additional notes that keep coming up:

`Orientation`

variable from 2017 is not reliable. Consider the 2019 season and 2018 season to look similar though.- Speed (
`S`

) is most similar in 2019 to 2018. In 2017, slightly different RFID tags were used for the player tracking data. - The data schema figure is updated to accurately reflect the
`Orientation`

and`Dir`

directions. Hopefully the approach here helps account for player direction, and you can read more at: https://www.kaggle.com/statsbymichaellopez/nfl-tracking-initial-wrangling-voronoi-areas

That said, *Please use this thread* to raise any questions about the data or any variables. Happy to answer below!

1) The NFL's player tracking data contains the đ„ and đŠ coordinates for each player and the football, collected at roughly 10 frames-per-second. Locational information is provided by signals sent from radio-frequency identification (RFID) chips that are placed inside each playerâs shoulder pads and inside the football. Speed, orientation, and distance traveled are straightforward to calculate using the tracking information -- in this contest, you're only receiving this tracking information at the moment a handoff is made.

2) The field coordinates are fixed at each NFL stadia. Often, the first step in any analysis of tracking data is to ensure offensive teams are moving in the same direction. This requires flipping roughly half of a gameâs offensive plays from one direction to the other (use the `PlayDirection`

variable, while creating new x and y coordinates, as well as new player orientations and directions. Additionally, standardizing by the playâs line-of-scrimmage may be warranted (particularly when thinking about player space). For a sample Kernal using the competition's primary data source, see my Kernal here: https://www.kaggle.com/statsbymichaellopez/nfl-tracking-initial-wrangling-voronoi-areas

3) Given updates to the RFID tags prior to the start of each season, small differences in speed measurements may exist from one year to the next. The 2019 data looks more similar to the 2018 data than to the 2017 data, for example. That said, tracking data is considered quite dependable; according to the Next Gen Stats group, location information is accurate to within +/- 12 inches, and reliable data has been collected on 99.999% of the entirety of players and games over the last two seasons.

4) One of the most crucial features to analyzing run play success is likely the amount of space that the ball carrier had available when he received the ball. Curious what space means in the context of a football game? Compare the two plays I shared on social media.

Play A: Zeke Elliot picks up 11 yards link Play B: Cordarrelle Patterson link

In the first example, Zeke had a bunch of space in front of him -- even more so when considering how fast he was moving when he got the ball. In the second, the play was doomed from the start. Defining what player space looks like in the context of football (and player positions, angles, and size) will go a long way towards improving your predictions.

5) Two places to look for ways of getting started:

- Last year's Big Data Bowl finalist papers are up at https://operations.nfl.com/the-game/big-data-bowl/2019-big-data-bowl/
- Notebooks from the NFL Punt Analytics Competition, albeit with slightly different variable names, are up at https://www.kaggle.com/c/NFL-Punt-Analytics-Competition/notebooks

The NFL Big Data Bowl competition has closed! We hope you were able to learn a lot, and use your machine learning skills on this interesting problem. Last yearâs Big Data Bowl (focused on pass plays) had 125 participants - this year had 2,175 participants on over 1800 teams. We had 32,000 submissions from 75 countries! For 460 users (including 14 in the top 100!), this was their first competition. We also had 71 new Masters or Experts at conclusion of this competition. Thank you all for your hard work in this competition and congratulations to our winners and to those who gained a new ranking!

We were excited to work with all the folks at the NFL like Mike Lopez (Winner of previous Kaggle competitions) and Jay Rogers. If you arenât following Mike on Twitter, heâs done a great job of documenting the progress of the competition, as well as just producing really interesting sports-related content. We're glad they choose Kaggle to be a platform to bring these types of problems to the broader data science community. They've been a pleasure to work with, and we hope to work with them again in the future.

And if you havenât already, be sure to check out the article in the Wall Street Journal, as well as the Moneyball podcast, both of which mentioned the competition.

We're also continually pleased to see how Kaggle Competitions serve as a great medium and vehicle through which the greater data science community can learn and develop their machine learning abilities and skills. Beginners and experts can come together to start, grow, and succeed. Further, in this competition, many can use their machine learning skills to help a great cause like this one.

We've cleaned the leaderboard and disqualified some teams that have violated the rules. If you think you were removed by mistake, or believe you have evidence that suggests another team cheated, please contact compliance. Please fill in all the fields honestly.

The top potential winning teams have been contacted via email to provide their winning solutions for host review. Should we need to move down the leaderboard for any reason, we will do so and continue to make contact with subsequent teams.

You should have already seen your points and medals awarded, however it may take a few hours to propagate through if yours are still missing.

In the meantime, I've created this thread to be a curated list of your work. As you post your solutions, top kernels, and general discussion posts, I'll update this thread to keep track. If you write a paper, thesis, or present at a conference, and would like to share your work - please let us know so we can share with the Kaggle community! Additionally, if you performed any data manipulation, or maybe used a new technique in this competition that you'd like to share to further the industry, please let us know and we'll post it here! That way, should you be looking for takeaways in the future, or if you've just now stumbled across the competition, you have a place to see what surfaced from the work performed.

I was particularly a fan of this tweet thread by 903124, which combined this year's analysis with last year's Punt Prediction Challenge. Pretty neat!

Happy Modeling! Kaggle Team

- 1st Place Solution
- 2nd Place Solution
- 3rd Place Solution
- 4th Place Solution
- 5th Place Solution
- 6th Place Solution
- 8th Place Solution
- 9th Place Solution
- 10th Place Solution
- 12th Place Solution
- 13th Place Solution
- 14th Place Solution
- 16th Place Solution
- 17th Place Solution
- 18th Place Solution
- 21st Place Solution
- 22nd Place Solution
- 23rd Place Solution
- 26th Place Solution
- 27th Place Solution
- 30th Place Solution
- 33rd Place Solution
- 34th Place Solution
- 42nd Place Solution
- 49th Place Solution

- Neural Networks Feature Engineering For the Win
- Initial Wrangling Voronoi Areas in Python
- Next Gen EDA
- NFL Tracking, Wrangling, Voronoi and Sonars
- NFL Simple Model using LightGBM
- Comprehensive Cleaning and EDA
- Plotting Player Position
- Cox Proportional Hazard Model
- Neural Network with MAE Objective
- NFL Big Data Visualization

How much better can we do? You tell us. This challenge will test your data science acumen, but more importantly, will involve loads of football-specific expertise. We encourage football fans to partner with data experts -- and visa versa -- and collaborate to identify what the most important features are that determine run play outcomes. The culmination of the contest is a live leaderboard that will function during the last 5 weeks of the 2019 NFL regular season (roughly, the month of December). You'll literally be predicting plays that haven't yet played out on the field.

Good luck!

]]>If you would consider yourself a beginner but don't know where to get started, let other Kagglers help you take your first steps here!

New to Kaggle? Take a look at a few videos our very own Dr. Rachael Tatman has put together to learn a bit more about site etiquette, Kaggle lingo, and how to enter a competition using Kaggle Notebooks.

]]>Given the complexity of this yearâs theme, we wanted a more open-ended and accessible version to be available to newer analysts who still wanted to play and learn from NFL player tracking data. Thus, our college subcontest. Any collegiate student (US and Canada only â apologies to all our international students) is eligible for the subcontest, and the theme is more open ended. Anything related to run plays, using any data set, will work.

If youâre interested in learning more, please visit our subcontest page, up at https://operations.nfl.com/the-game/big-data-bowl/terms-and-conditions/

Thanks for checking us out!

]]>We know many of you are eagerly awaiting the first rescore. I wanted to share that we started reruns today, noticed an unusually high error rate, investigated, and realized we had accidentally omitted the milliseconds component of timestamps in the new dataset. We will restart the reruns and ideally have something to post tomorrow.

Thanks for your patience!

]]>We want to sincerely thank the hosts and Kaggle for making this competition possible. We had a lot of fun crafting our solution as it was necessary to think a bit out of the box and come up with something that really reflects the situation on the field. An extra thanks goes to Michael Lopez for actively participating in all the discussions and activities around the competition. That did add motivation to improve and believe that we can bring some value to NFL analytics. Canât remember the last time weâve seen such involvement of a host into the competition.

There was little problem with the data (2017 measurement differences were disclosed) and there was a nice correlation between CV and public LB. There was also no real chance to cheat as private LB will be on future data. We also want to thank all competitors for not exploiting the possible leak in public LB.

We really hope there wonât be any surprises on the private LB data and we hope our kernels will run through. In these types of kernel competitions there is always the risk of something failing, which would be devastating, of course.

Regardless of what happens, we are really proud of our solution and strongly believe that it can be a valuable asset to future endeavors in NFL analytics.

**TL;DR:** Itâs a 2d CNN based on relative location and speed features only.

Few words about how we came up with the model structure. To simplify we assume a rushing play consists of: - A rusher, whose aim is to run forward as far as possible - 11 defense players who are trying to stop the rusher - 10 remaining offense players trying to prevent defenders from blocking or tackling the rusher

This description already implies connections between which players are important and which might be irrelevant, later we proved it to be the case on CV and LB. Here is an example of play visualization we used (based on the modified kernel from Rob Mulla [1])

If we focus on the rusher and remove other offense team players, it looks like a simple game where one player tries to run away and 11 others try to catch him. We assume that as soon as the rushing play starts, every defender regardless of the position, will focus on stopping the rusher asap and every defender has a chance to do it. The chances of a defender to tackle the rusher (as well as estimated location of the tackle) depend on their relative location, speed and direction of movements.

Another important rule we followed was not to order the players, because that would force an arbitrary criteria into the model, which will not be optimal. Besides, the picture from above gives us the reason to believe each defender should be treated in a similar manner.

That points to the idea of a convolution over individual defenders using relative locations and speeds, and then applying pooling on top.

At first we literally ignored the data about 10 offense players and built a model around the rusher and defenders, which was already enough to get close to 0.013 on public LB. Probably with proper tuning one can even go below 0.013.

To include the offense team player we followed the same logic - these 10 players will try to block or tackle any of the defender if there is a risk of getting the rusher stopped. So, to assess the position of a defender we want to go through all the offense team players, use their location and speed relative to the defender, and then aggregate. To do so, we apply convolution and pooling again. So good old convolution - activation - pooling is all we needed.

The logic from above brought us to the idea of reshaping the data of a play into a tensor of defense vs offense, using features as channels to apply 2d operations.

There are 5 vector features which were important (so 10 numeric features if you count projections on X and Y axis), we added a few more, but they have insignificant contribution. The vectors are relative locations and speeds, so to derive them we used only âXâ, âYâ, âSâ and âDirâ variables from data. Nothing else is really important, not even wind direction or birthday of a player ;-)

The simplified NN structure looks like this:

So the first block of convolutions learns to work with defense-offense pairs of players, using geometric features relative to rusher. The combination of multiple layers and activations before pooling was important to capture the trends properly. The second block of convolutions learns the necessary information per defense player before the aggregation. And the third block simply consists of dense layers and the usual things around them. 3 out of 5 input vectors do not depend on the offense player, hence they are constant across âoffâ dimension of the tensor.

For pooling we use a weighted sum between both average and max pooling with average pooling being more important (roughly 0.7). In earlier stages of the model, we had different kinds of activations (such as ELU) as they donât threshold the negative weights which can be problematic for the pooling, but after tuning we could switch to ReLU which is faster and had similar performance. We directly optimize CRPS metric including softmax and cumsum.

For fitting, we use Adam optimizer with a one cycle scheduler over a total of 50 epochs for each fit with lower lr being 0.0005 and upper lr being 0.001 and 64 batch size. We tried tons of other optimizers, but plain Adam is what worked best for us.

We were quite fortunate to discover a really robust CV setup. Probably, we will never have such a nice CV again. In the end, it is quite simple. We do 5-fold GroupKFold on GameId, but in validation folds we only consider data from 2018 (similar to how Patrick Yam did it [2]). We saw very strong correlations between that CV and public LB as 2019 data is way more similar to 2018 data compared to 2017 data. Having the 2017 data in training is still quite crucial though. As we are using bagging on our final sub, we also bagged each fold 4 times for our CV, meaning our final CV is a 5-fold with each fold having 4 bags with random seeds.

Having such a strong CV setup meant that we did not always need to check public LB and we were quite confident in boosts on CV. We actually had quite a long period of not submitting to public LB and our improvements were all gradual. Based on given correlation, we could always estimate the rough LB score. You can see a plot of some of our CV and LB models below. The x-axis depicts the CV score, and y-axis respective LB score. Blue dots are models actually submitted to LB, and red dots are estimates. You can see that we lost the correlation only a tiny bit in the end, and our theoretical public LB score would have been below 0.01200. Our final CV for 2018 is around 0.012150.

As we assume most people did, we adjusted the data to always be from left to right. Additionally, for training we clip the target to -30 and 50. For X,Y and Dir there is no other adjustment necessary, however, as most have noted, there are some issues with S and A. Apparently, the time frames were slightly different between different plays.

For S, the best adjustment we found is to simply replace it with Dis * 10. A is a bit more tricky as there is apparently some form of leak in 2017 data (check the correlation between rusher A and target). So what we did is to adjust A by multiplying it with (Dis / S) / 0.1. That means we scale it similarly to how we scale S. After all, A only has a tiny signal after this adjustment, and one can easily drop it. As we rely on relative features in the model, we donât apply any other standardization.

What worked really well for us is to add augmentation and TTA for Y coordinates. We assume that in a mirrored world the runs would have had the same outcomes. For training, we apply 50% augmentation to flip the Y coordinates (and all respective relative features emerging from it). We do the same thing for TTA where we have a 50-50 blend of flipped and non-flipped inference.

We decided quite early that it is best to do all the fitting within the kernel, specifically as we also have 2019 data available in the reruns. So we also decided early to spend time on optimizing our runtime, because we also knew that when fitting NNs it is important to bag multiple runs with different seeds as that usually improves accuracy significantly and it removes some of the luck factor.

As mentioned above, we use Pytorch for fitting. Kaggle kernels have 2 CPUs with 4 cores, where 2 of those cores are real cores and the other 2 are virtual cores for hyperthreading. While a single run is using all 4 cores, it is not optimal in terms of runtime, because you cannot multiprocess each operation in a fit. So what we did is to disable all multithreading and multiprocessing of Python (MKL, Pytorch, etc.) and did manual multiprocessing on a bag level. That means we can fit 4 models at the same time, gaining much more runtime compared to fitting a single model on all 4 cores.

Our final subs fit a conservative number of 8 models each, having a total runtime of our subs at below 8500 seconds.

- Transformers and multihead attention, which seem to approximate the dependencies we explicitly use. We mainly focused on trying out attention to include offense-offense and defence-defence dependencies.
- LSTM instead of CNN.
- Adding dependencies like offense-offense and defence-defence explicitly.
- As soon as all the inputs are vectors, it seems tempting to try complex numbers based NNs. There is even a nice paper and a github repo available with the math and implementation of complex number version of all the layers we are using [3], but in keras. However, all the attempts weâve made to limit CNNs to vector operations gave worse results.
- Going deeper and wider.
- CNN adjustments known from CV like Squeeze-and-Excitation layers or residual networks.
- Voronoi features.
- Weighing 2018 data higher than 2017 data.
- Multi-task learning, label smoothing, etc.
- Other optimizers, schedulers, lookahead, etc.

Our first sub is our best model fitted on an 8-fold with picking the best epochs based on CV using 2018 and 2019 data (in the rerun, only 2018 in public LB). This model currently has 0.01205 public LB. Our second sub is using full data for fitting with fixed epochs (no early stopping). It currently has public LB 0.01201.

In private reruns we incorporate 2019 data into training and we hope that all goes well, but you never know.

P.S. Donât forget to give your upvotes to @philippsinger as well - this model is a great example of teamwork.

[1] https://www.kaggle.com/robikscube/nfl-big-data-bowl-plotting-player-position [2] https://www.kaggle.com/c/nfl-big-data-bowl-2020/discussion/119314 [3] https://arxiv.org/abs/1705.09792

]]>**In this topic, I explain the Rules of NFL American Football. A beginner's explanation of American Football Rules.**

Read this topic guide on how to play NFL Gridiron Football by National Football League, NCAA and International rules. Learn about touchdowns, interceptions, fumbles, penalties, offense, defense, downs, fouls and more!

**Let's start :**
The object of the game is for your team to score more points than the opposing team. Teams are made up of 46 players in the NFL, with 11 players taking the field at any one

> The field is 100 yards long by 53 yards wide, with two 10 yard endzones at each end.

White markings on the field help players, refereeâs and spectators keep track of whatâs going on.

The game starts with a kickoff. The team with possession of the ball is known as the offense, and the team without the ball is the defense.
The job of the offense is to move the ball up the field and score points.
This can be done by either **running forwards with the ball**, or by **throwing it up the field for a teammate to catch**.

The offense is given 4 chances (or 4 downs) to make at least 10 yards. If the offense manages to move the ball 10 yards or more, they will retain possession of the ball whilst given another 4 downs to make another 10 yards.

On your TV screen, you will see this graphic. This tells you what down the team is on and this tells you how many yards they need to make. If youâre also watching this on TV, they will also show the lines they need to cross in order to make their downs.

The defenceâs job is to stop the offense moving the ball forwards by tackling. This includes pulling them to the ground, stopping them from moving forward or forcing them off the field. If the offense fails to move the ball 10 yards within 4 downs, the ball is given to the defending team at that point. The defending team will then bring on their offensive players and try and move the ball in the opposite direction so that they can score. You will most likely see an offense kick the ball away on fourth down to make it more difficult for the other team to score.
The teams will usually have three different units of 11 players that come on the field at different times.
They include:
1. The Offense: These players will usually come on the field when they have possession of the ball. The offensive unit consists of these positions.
2. The quarterback: is the most important player on the field as heâs the one who decides to pass the ball up the field, hand it off to a teammate so that they can run with it, or run with it himself. These offensive line positions are usually responsible for protecting the quarterback.
3. The wide receivers: are responsible for running down the field to catch the ball thrown by the quarterback.
4. The running back and full back: is responsible for running with the ball up the field.
5. The Defense: These players will usually come on the field when the other team has the ball. The defensive unit consists of these positions:
**The defensive line:** is responsible for moving past the offensive line.
**The line-backers:** stop running backs coming through the defensive line and they also are responsible for attacking the quarterback.
**The cornerbacks:** try and stop the wide receivers. And the safeties try and stop a pass up the middle of the field.
**Special Teams:** Special teams are specialist players that come on the field when there is a kick involved. Within the special teams is a mix of offensive and defensive players mixed with either a punter or kicker for offense, or a punt returner for defense.

In American Football, thereâs four different ways of scoring:
1. **Touchdown:** The main way of scoring is via a touchdown. If the ball is carried into the endzone area, or thrown and caught in the endzone, this is a touchdown and is worth **6 points**. Unlike in Rugby, you do not need to touch the ball down on the ground, all you have to do is cross the line with the nose of the ball to score.
2. **Extra points:** Once a touchdown has been scored, you have the option of kicking it through the uprights for an extra point, or try and pass or run the ball into the endzone again for an extra two points. Most teams play it safe and go with the one point.
3. **Field Goal:** At any time, the team with the ball can kick the ball between the posts and over the crossbar. To do this, they must hand it to a teammate who will hold it on the ground ready for a kicker to make the kick. A successful kick scores 3 points.
4. **Safety:** If the defense tackles an offensive player behind his own goal line, the defending team scores two points.

The game is played in 4 x 15 minute quarters, for a combined playing time of 60 minutes. Highest score at the end of 60 minutes wins. Ties are rare in American Football, and overtime periods are played if necessary to determine a winner. Different leagues have different rules about tie games.

**Is that it? Is that all I need to know.**

Well, youâre almost there, but American Football is filled with lots of rules, and youâll need to understand a few more of them before you watch or play a game. For example:
- **FUMBLE:** If a ball carrier or passer drops the ball, that's a fumble. Any player on the field can recover the ball by diving on it or he can run with it. The team that recovers a fumble gets possession of the ball.
- **INTERCEPTION:** An aggressive defense can regain possession of the ball by catching (intercepting) passes that are meant for players on the other team. Both fumble recoveries and interceptions can be run back into the end zone for touchdowns.
- **SACK:** If the defense tackles a Quarterback whilst he has possession of the ball, this is known as a âsackâ. This is detrimental to the offense, as a down is wasted and it usually results in a loss of yards.
- **INCOMPLETE PASS:** If a pass intended to a receiver hits the ground first, it is ruled an incomplete pass. A down is wasted and play restarts from the sport of the last down.
- **PENALTY:** If a player breaks one of the rules, referees will throw flags onto the field. They will determine who made the foul and how many yards his team should be penalised.
- **CHALLENGE:** If a coach disagrees with a decision on the field, they can throw red flags onto the field. The previous play will then be reviewed and if the challenge is successful â the ruling on the field is reversed. If the challenge is unsuccessful and the ruling on the field stands, they forfeit one timeout.
- **TIMEOUTS:** If a team wants to stop the clock to regroup, take a break or discuss strategy, they are allowed three time-outs per half. Each time out lasts 60 seconds. Players get a break of 12 minutes at half time. This is all a lot to take in, but once you start playing or watching American Football, the rules will become clear.

I this competition, your task is to predict the result of a play when a ball carrier takes the handoff. As an âarmchair quarterbackâ watching the game, you may think you can predict the result - but what does the data say?

In this competition, you will develop a model to predict how many yards a team will gain on given rushing plays as they happen. You'll be provided game, play, and player-level data, including the position and speed of players as provided in the NFLâs Next Gen Stats data. And the best part - you can see how your model performs from your living room, as the leaderboard will be updated week after week on the current seasonâs game data as it plays out.

**Some seful videos you must watch: **
The Rules of American Football - EXPLAINED! (NFL)
A Beginner's Guide to American Football | NFL
THE BEGINNERS GUIDE TO AMERICAN FOOTBALL
Introduction to Football: Positions
Learn American Football in 5 Minutes

If you have found this explanation at all helpful, please **Upvote**, rate, give me a comment if there is a missing details.

Here's what they mean:

Some interesting observations: - Most are engineered - Most are related to play's PHYSICS: distances, times, accelerations, positions - All are CONTINUOUS variables

I'd love to get your inputs/comments about your best features so far, and in which direction you are engineering features. So far, imho, engineering features about the play's physics looks like the way to go.

Cheers! PS: Btw, all features are either standardized (no clear min/max) or normalized (clear min/max) before training. PS2: There's an (important) typo on the table: Orientation full offensive is 180, full retreat is 0. PS3: Runner's direction is normalized exactly as Orientation

]]>- 5Fold: OOF CV 0.01244 LB: 0.01341

My best score is currently a single LGBM , CV 0.0122 , LB 0.01359. Can you share your best single model type and LB score.

]]>As we all know data in 2017 is different from 2018, data cleaning is very important in this competition. - Orientation: 90 degree rotation in 2017 - A: I cannot find a good way to standardize A, I replace A in 2017 by the mean, surprisingly this improve my LB by 0.0002 - S: if we look at 2018 data, we can see that S is linearly related to Dis While data in 2017 is not very fit, By fitting a linear regression on 2018 data, the coefficient of lr is 9.92279507, which is very close to 10, so finally I replace S by 10 * Dis for both 2017 and 2018 data. This also gave me 0.0002 improvement.

total 36 features, ['IsRusher','IsRusherTeam','X','Y','Dir_X','Dir_Y', 'Orientation_X','Orientation_Y','S','DistanceToBall', 'BallDistanceX','BallDistanceY','BallAngleX','BallAngleY', 'related_horizontal_v','related_vertical_v', 'related_horizontal_A','related_vertical_A', 'TeamDistance','EnermyTeamDistance', 'TeamXstd','EnermyXstd', 'EnermyYstd','TeamYstd', 'DistanceToBallRank','DistanceToBallRank_AttTeam','DistanceToBallRank_DefTeam', 'YardLine','NextX','NextY', 'NextDistanceToBall', 'BallNextAngleX','BallNextAngleY', 'BallNextDistanceX','BallNextDistanceY','A']

Always include 2017 data for training, 3 group folds by week for 2018 data, use only 2018 data for evaluation. In this way the CV score is close to public LB.

Transformer (2 layers encoder + 2 layers decoder), large number of attention head is the key .png?generation=1574922346398274&alt=media)

Optimizer: RAdam + lookahead Number of epoch: 30 Batch Size: 32 Weight Decay: 0.1 Ensemble: snapshot ensemble (pick models at epoch 11, 13,...,17,29) Learning rate scheduler: 8e-4 for epoch 0-10,12,14,...,28. 4e-4 for epoch 11,13,...,29

Since we are only given 4 hours CPU training, snapshot ensemble seems to be a perfect choice as it wonât increase our training time and is significantly better than single model. In my final submission, I repeat the training (use all data) for 11000s and 9000s (safe mode).

]]>Anyway, here is my version. Please indulge me, this is the first time I create a complex (well, complex for me) NN). I did what I could in 10 days. I'm sure it can be improved in many ways.

**The transformer**

I decided to implement it from scratch using Keras because I wanted to learn the transformer architecture insetead of using someone else' implementation. For those not familiar with the transformer I recommend these two tutorials, they helped me a lot:

http://jalammar.github.io/illustrated-transformer/

https://nlp.seas.harvard.edu/2018/04/03/attention.html

My thinking was heavily influenced by top team models in molecular competition, especially #6 solution for its simplicity:

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106407

This model had a small number of layers, which was very interesting given the limit on cpu time here. I also look at his author git repo to sort out doubts like: is layer norm performed before or after dropout?

https://github.com/robinniesert/kaggle-champs

**Data cleaning**

I normalized data like in my public notebook. I wish I had used 10*Dis rather than S, and also replaced A in 2017 by a constant. Adding these after deadline improved my CV by almost 0.00015. I wish I could see the effect on LB.

In addition to flipping along the X axis as in my notebook, I also flipped along the Y axis if need be so that rusher always moves towards top right when dislaying plays.

**Features**

My model uses almost no feature engineering. It uses all in all:

- Player features:
`'X', 'Y', 'X_dir', 'Y_dir', 'X_S', 'Y_S', 'S', 'A', 'IsRusher', 'IsOnOffense'`

, X and Y are relative to rusher position. - Distance matrix: square of inverse distance matrix.
- Play features: rusher position and yardline.

That's it.

**Architecture**

Similar to the Molecular solution I started from, I start with embedding players features into a latent vector via a dense layer. I used 64 long embeddings.

Then I use a distance attention bloc. To update a given player embedding I use a weighted sum of the other players embeddings. The weight depends on the distance. I tried various ways, and a normalized squared inverse was best. I was about to try other transforms when I decided to have them learnt by the model, via a 1x1 convolution bloc on the data.

All my convolution blocs have a skip connection and 2 convolution layers with ReLU activation. As in the transformer I used a Glorot uniforrm weight initialization everywhere I thought of.

The distance attention is added to the skip connection, then normalized with a custom LayerNorm, followed by dropout. I use dropout 0.25 everywhere.

Next bloc of layers is a vanilla transformer multi head attention. Well, that's what I tried to implement, and any difference with the transformer is a mistake and unintended. If someone has the courage to read my code and provide feedback then I'd be extremely grateful! I used 4 attention heads, and length of 16 for queries, keys, and values.

**Isotonic regression**

The output of multi head attention restricted to the rusher embedding is concatenated with the embeddings of play features. Then this is fed into two output layers. The first one is a linear layer with 199 output units followed by a sigmoid activation. The output is the 199 probabilities.

Issue with this is that there is no reason why these probabilities are monotonically increasing. I didn't like the fix used in many public kernels, which was to replace each probability by the max of all probabilities until it. I didn't like it because if you did the transformation the other way round, starting from the right and taking the min, then you did not get the same result.

I tried to output a softmax and then compute cumsum, but this was slower. I ended up running an isotonic regression to make the output monotonically increasing. Isotonic regression improved a bit CV and LB over using max from start.

**Logistic output**

I was still unhappy with output. We are asked to output a cumulated distribution function, hence we should base it on a distribution. I looked at various distributions, but none were perfect fit. I ended up doing some EDA. Start with the cumulated histogram of all yards values in train, it looks like a sigmoid skewed on the right:

This made me think of taking the logits of this cdf. It yields:

This plot is very interesting. We see that logits increase linearly up to near 0, then there is a smooth transition, and logits increase again linearly past 0, albeit with a much smaller slope. This can be approximated quite well with two half lines as show below.

All I needed are the slopes of the two lines, and the x, and y of where they meet. Said differently, I could recreate the output from 4 numbers. I implemented a custom layer that outputs the two straight lines values, followed by a sigmoid activation. I later settled on 3 numbers only, the two slopes, and x where they meet. This was as good if not better.

This new output was better than the simple one, but using both outputs was better. Probably because optimizing two different outputs adds some regularization.

**Training**

I used 12 fold time based (unshuffled) CV with validation folds drawn from 2018 only. Ie also down weighted 2017 samples by 0.5. This made CV way more in line with LB. Down weighting 2017 may have helped a bit given I had not standardized S and A correctly. I used my local machine with 2 1080 Ti for developing the model. I had been burned too much by Kaggle kernels being reset for no reason. For final submission, I uploaded my notebook, and only used last 2 folds.

I used Adam optimizer with a learning rate decay on plateaus, and early stopping. From what I read, a predefined linear decay was best, but I hadn't time to tune it.

**Data augmentation**

Last day of competition, my team mate Reza made me think of using predicted future positions of players. For a given play I created 2 copies, after 0.3 and 0.6 seconds, assuming straight trajectories and constant acceleration. I'm still wondering if acceleration is always in same direction as speed. Indeed, it could be that some players decelerate... Of course, play copies were put in same fold to avoid overfitting.

For final prediction I also created 2 copies of each test play, then averaged the predictions of the 3 plays. This yields almost 0.00010 improvement on CV.

Data augmentation led to a LB of 0.01299 less than one hour before deadline...

**Lessons**

First, I should have followed my hunch much earlier. I guess I was a bit intimidated by the task. Second, I wish I had cleaned data more, esp S and A as shared by many top teams. Third, I should not use Kaggle kernels for model development, they are too unreliable when running times exceed one hour. They get reset even if they are attended and used interactively.

One thing I don't regret is to have teamed with Reza. He helped me understanding NFL Football. Also his implementation of influence and pitch control was very enlightening. I want to use a similar idea (Gaussian mixture) in a layer to preprocess distances before distance attention. Last, but not least, his models are better than mine :D

All in all, even if we probably will miss gold I am quite happy because I learned a lot. And now I can follow writeups of people who also used the transformer architecture!

The code can be seen at https://www.kaggle.com/cpmpml/keras-80?scriptVersionId=24171638

Edit: a much better model can be seen in the latest version of the notebook: https://www.kaggle.com/cpmpml/graph-transfomer?scriptVersionId=24417998

I have improved my code in several ways, including: - implemented encoder/decoder attention of the original transformer architecture - added squeeze and excitation to the convolutions - S and A cleaning

My CV improved by about 0.00025, not enough to make it a top model, but still interesting. Data cleaning brings about 0.00010.

I think the main interest is the transformer implementation done with Keras functional api. It yields a code much more compact that what we can find online. Here is the updated NN architecture:

The decoder part is simpler than the transformer one as I didn't include convolution bloc nor a self attention bloc, because the decoder input is so simple (only 3 features). The code can be seen at https://www.kaggle.com/cpmpml/graph-transfomer?scriptVersionId=24417998

]]>Our two final submissions are both around 0.01203. It seems good but really is it?

]]>Does anyone have any good advices on this? What can potentially go wrong with the test data on stage 2 and how to tackle these issues (apart from the very obvious one - like you get a nan value in any column)?

]]>If you standardize S by the mean and std of each season, the distribution gets something like this:

In my case, I got 0.00001 boost on both CV and LB after standardizing S in this way.

Here are the mean and std in case you are interested: 2017 S mean: 2.4355, S std: 1.2930 2018 S mean: 2.7570, S std: 1.4551 2019 S mean: 2.7456, S std: 1.4501

I also created a notebook on this topic: https://www.kaggle.com/tnmasui/standardizing-s-by-seasons?scriptVersionId=22142886

]]>It looks like angle is increasing clockwise, and not counterclockwise as indicated. I am therefore using:

```
radian_angle = (90 - angle) * math.pi/180.0
```

Edit: I'm using this in this notebook inspired by @statsbymichaellopez R notebook: https://www.kaggle.com/cpmpml/initial-wrangling-voronoi-areas-in-python

]]>