Idea
“Currently, the state-of-the-art similarity metrics are only implemented in R. We want to port these to Python and implement these in frameworks like Tensorflow.”
Context of the internship
-The Wasserstein distance has been around for centuries but recently is causing a furore in ML. In essence, you calculate how different two distributions are, and the result is a number between 0 and +inf.
-Now, we can use the Wasserstein distance as a metric to calculate the degree of difference between two probabilistic functions, but we have to go with a parametric version of it on real life data to estimate the actual Wasserstein distance of the two underlying distributions.
-The question that pops up is: How do we define when 2 distro's are different using the Wasserstein distance? How do we go about hypothesis testing? 🤔
-We are not the first ones to think about this. Schefzik et al. have come up with a way to test this and implemented it in R.
So... We want to make this test available in python and add it to scipy and TensorFlow Data Validation.