XNN Regression
If you're involved in AI, you may think this is some sort of neural network related to the RNN or CNN maybe? If you're not involved in AI, you're probably going to leave if I use another acronym without explanation (AI stands for Artificial Intelligence btw - I mean by the way). So what I am calling an XNN is actually a close relative to something called a KNN. KNN stands for K-nearest neighbors and is actually a very simple algorithm to understand.
KNN
If you have ever bought a house you have probably essentially used KNN to some degree. I haven't bought a house so I am just going to imagine. I imagine before making an offer on a house you would want to know what you think a fair price is. So to come up with an educated guess you ask the neighbor to the left and the neighbor to the right how much they paid for their houses and then average the two. You probably didn't do exactly that, but if you had, you would have performed K-nearest neigbors where K equals two. Basically, you predicted something by averaging the most similar known examples.
While taking "nearest neighbor" literally is helpful to illustrate KNN, it may confuse you later. Nearness is a measure of similarity. Distance is how we calculate similarity; however, know that distance expands beyond simply "how many houses away was the neighbor." Nearest neighbors in the real estate example should probably involve the date of sale in addition to geographic proximity. Your geographic neighbors may have bought their houses 30 years ago and would not be as similar to your situation as the house a mile away that sold yesterday. Nearness for this example could involve hundreds of different attributes: square feet; number of floors, bedrooms, bathrooms; has garage, pool; and so on.
There are many variations of KNNs. I won't get into it but there are even KNNs for classification where the prediction isn't a continuous number like house price but rather a class - a medical diagnosis, for example, where you have symptoms and you want to know the most likely illness. The type of KNN I am focusing on is for continuous-valued predictions though. This is called regression. A slightly more sophisticated approach for a regression KNN than the simple average I mentioned is a weighted average based on how near, or similar, each of the K neighbors are.
XNN
So I should probably read the research on KNN to see how unoriginal what I am doing is, but ignorance is bliss so I'm just going to pretend I am a genius. The primary difference in my implementation of the algorithm and anything I have seen elsewhere is that mine doesn't use any predetermined K. Typically, you have a set K for all of your predictions. As you might imagine, choosing a good K isn't easy. In the real estate example, you could choose a K of one but you would probably experience overfitting / variance which basically means the algorithm doesn't generalize well. If your nearest neighbor was someone in a hurry who had no time to negotiate, you'd probably find that you can get away with paying a lot less. With a larger choice for K though you can offset those people with all the people who were in no rush. If you pick too large of a K though and don't properly weight each neighbor's impact, you end up with an underfitting / bias problem. In the extreme case this just means the algorithm just predicts the same price for every house - the average price of all houses it knows about.
The following are graphs I made to illustrate these problems. In the first pair the green dots (which are particularly hard to see in the overfitting image) represent the predictions a KNN model would give at those x-values using the purple dots as its known examples. In the second pair of graphs you can see which points were chosen as neighbors (the black dots), which were excluded (the red dots), and what output the neighbors produced (the green dot). The size of the black dots is supposed to represent its weight on the output but could use some tweaking to show more contrast.
My XNN uses a variable number of neighbors. That variable is the X in XNN. How I got to the solution is a bit of a long story. It all started because KNN is slow at making its predictions relative to other algorithms because it has to look at all the known examples to determine which K are nearest to the one it's trying to predict. Because of this, I figured it was a good candidate for GPU optimization - basically making the algorithm as parallelized as possible so it can calculate 1000 distances in parallel versus having to do so one after the other. Once I started parallelizing my KNN algorithm I realized that efficiency was no longer much of an excuse for keeping K small. Take that to the extreme and you make K equal to the number of known examples you have. So every example is a nearest neighbor. In two dimensions this was producing beautifully smooth graphs. However, I was experiencing underfitting particularly on the edges. This makes sense because even if you give examples on the opposite edge really small values, they will ever so slightly pull at the predicted value. I played around with various weighting schemes and couldn't manage to find one that didn't suffer from this. The middle portion of the graphs always looked perfect though so I set out to fix the edge problem without ruining the center.
My first idea was simply to look at how far away the point trying to be predicted was from the center of all known points. This would tell me how edge-y a point was. If I know how edge-y a point is, I can reduce K, now X, to be a smaller, more appropriate number. This may have worked pretty well for my purposes if I had fixed the weighting scheme I was using at the time, but it wasn't performing quite as well as I hoped so I moved on to a better solution. Instead of just making X a function of how far from the center the point is, I could look at how balanced the nearest neighbors were around the point to be predicted. So the new algorithm became something like:
- Pick the nearest neighbor
- Calculate the average distance of the points chosen so far from the point to be predicted
- Pick the next nearest point that helps balance the points chosen so far around the point to be predicted
- If no points make it more balanced, stop picking points
- Otherwise go back to step 2
This way the edges will not use a large X because by definition of being edge-y, they have nothing to balance them on one side. Even though this is definitely a more thorough approach, it suffered from overfitting edges. So I realized I was going to have to live with some imbalance on the edges and just try to make my weighting scheme compensate for the larger X. So I looked into the weights I was getting back and found that the weights were all relatively similar which meant I was calculating something too close to a simple average. My current solution is to pick the nearest neighbors with some minimum (currently arbitrarily chose 10), determine the average distance and standard deviation from the point to be predicted, and finally calculate the z-score and percentile that each point is in. This means a point really close might have a weight correlated to 99 (for the 99th percentile) while a point relatively far away would have a weight correlated with 1 (for the 1st percentile). In such a case, the 99th percentile point would have 99x the impact of the 1st percentile point.
The edges are still slightly overfitting, but it is much improved over many of my previous attempts. I may still mess with the weighting scheme but I think I am fairly sold on the idea that picking an arbitrary K is never going to be an all around good solution compared to allowing balance to vary the number of neighbors chosen. Lastly, I will probably try to find a non-arbitrary way to chose a minimum number of neighbors. The edges need to be allowed to have relatively few, but they also need more than one example to give a smooth generalized output.