[ad_1]
Using the Monte Carlo method to visualize observational behavior with a very large number of features
Imagine a data set consisting of a certain number of observations, each observation having N characteristics. If you convert all the features into a numerical representation, you can say that each observation is a point in an N-dimensional space.
When N is low, the relationship between the points is exactly what you would intuitively expect. But sometimes N grows too large — this can happen, for example, if you’re building a lot of features with one hotcoder, etc. The distance between them is somewhat larger than you would expect.
The phenomenon is real. As the number of dimensions N increases, and all things remaining the same, the N-volume containing your observations does increase in some sense (or at least the number of degrees of freedom becomes larger), and so does the Euclidean distance between observations. . The pool of points actually becomes scarcer. This is the geometric basis of the dimensional curse. The behavior of the models and techniques used for the dataset will be affected by these changes.
Many things can go wrong if the number of features is too large. Having more features than observations is a typical setup for overfitting models during training. Any brute force search in such a space (eg GridSearch) becomes less efficient – you need more trials to cover the same intervals along any axis. A subtle effect affects any model based on distance or proximity: As the number of dimensions increases to some very large values, if you consider any one point in your observation, all other points are far away and somehow almost equidistant – Since these models rely on distance to do their job, adjusting for differences in distance makes their job much more difficult. e.g. Clustering does not work well if all points are nearly equidistant.
For all these reasons, and more, techniques like PCA, LDA, etc. have been developed. – To try to escape from the peculiar geometry of spaces with many dimensions and to distill the data set down to a few dimensions. compatible with the actual information contained therein.
It’s hard to intuit the true magnitude of this phenomenon, and visualizing spaces with more than 3 dimensions is extremely difficult, so let’s do some simple 2D visualizations to help our intuition. There is a geometric basis for why dimension can become a problem, and this is what we will visualize here. If you haven’t seen it before, the results may be surprising – the geometry of high-dimensional spaces is much more complex than typical intuition suggests.
Consider a square of size 1 centered at the origin. You inscribe a circle in a square.
This is a setup in 2 dimensions. Now consider the general, N-dimensional case. In 3 dimensions, you have a sphere inscribed in a cube. Additionally, you have the N-sphere inscribed in the N-cube, which is the most general way to set it up. For simplicity, we will refer to these objects as a “sphere” and a “cube”, no matter how many dimensions they have.
The volume of the cube is fixed, it is always 1. The question is: as the number of dimensions N changes, what happens to the volume of the sphere?
Let’s answer the question experimentally, using the Monte Carlo method. We will generate a very large number of points that are distributed uniformly but randomly in the cube. For each point, we calculate its distance to the origin – if this distance is less than 0.5 (the radius of the sphere), then the point is inside the sphere.
If we divide the number of points inside the sphere by the total number of points, this will approximate the ratio of the volume of the sphere to the volume of the cube. Since the volume of the cube is 1, the ratio will be equal to the volume of the sphere. The approximation improves when the total number of points is large.
In other words, the ratio inside_points / total_points
The volume of the sphere will be approximate.
The code is quite simple. Since we need many points, obvious loops should be avoided. We could use NumPy, but it’s CPU-only and single-threaded, so it would be slow. Potential alternatives: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), etc. We’ll use PyTorch – but the NumPy code will be almost identical.
If you follow the nested torch
statements, we generate 100 million random points, calculate their distances to the origin, count the points inside the sphere, and divide the number by the total number of points. The ratio
An array will end up containing the volume of the sphere in a different number of dimensions.
Tunable settings are set for GPU with 24GB of memory — change them if your hardware is different.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# force CPU
# device = 'cpu'# reduce d_max if too many ratio values are 0.0
d_max = 22
# reduce n if you run out of memory
n = 10**8
ratio = np.zeros(d_max)
for d in tqdm(range(d_max, 0, -1)):
torch.manual_seed(0)
# combine large tensor statements for better memory allocation
ratio[d - 1] = (
torch.sum(
torch.sqrt(
torch.sum(torch.pow(torch.rand((n, d), device=device) - 0.5, 2), dim=1)
)
<= 0.5
).item()
/ n
)
# clean up memory
torch.cuda.empty_cache()
Let’s imagine the results:
[ad_2]
Source link