Transformation of data

This problem is done to show that, even though data does not look like having an easy model to do the fit, after a small transformation it can be found.

We import the libraries

In [1]:
import numpy as np
import matplotlib.pylab as plt
from scipy.optimize import curve_fit

Then we save the url where the data is,

In [2]:
url='https://raw.githubusercontent.com/jmsevillam/Herramientas-Computacionales-UniAndes/master/Data/Regression/spiral.dat'

We load the data to our program

In [3]:
x,y,r,theta =np. genfromtxt ( url ).T

As an exploratory step, we plot the data,

In [4]:
plt.plot(x,y,'.')
Out[4]:
[<matplotlib.lines.Line2D at 0x21497f985f8>]

Finding a model doesn't look like an easy task, but maybe after a small transformation, it does.

The data gives the values of $r$ and $\\theta$, but in principle, they can be found from a transformation.

In [5]:
plt.plot(r,theta,'.')
Out[5]:
[<matplotlib.lines.Line2D at 0x21497ffea20>]

We will do the fit using

In [6]:
def f(x,a,b):
    return a*x+b
In [7]:
popt,pcov=curve_fit(f,r,theta)

Such that, after the fitting we get

In [8]:
plt.plot(r,theta,'.')
plt.plot(r,f(r,*popt))
Out[8]:
[<matplotlib.lines.Line2D at 0x21498057dd8>]

So, as we want to recover the $x,y$ plane, we proceed to to a resampling so that the final result is smooth.

In [9]:
r_plot=np.linspace(0,10,1000)
x_plot,y_plot=r_plot*np.cos(f(r_plot,*popt)),r_plot*np.sin(f(r_plot,*popt))
In [10]:
plt.plot(x,y,'.')
plt.plot(x_plot,y_plot)
Out[10]:
[<matplotlib.lines.Line2D at 0x21498057710>]

Overfitting

One of the most important problems when learning to manage data, is that description is not prediction, so we will do a fit of a certain group of data, such that when we add a new point, it does not work as well as a simpler model.

In [11]:
url='https://raw.githubusercontent.com/jmsevillam/Herramientas-Computacionales-UniAndes/master/Data/Regression/data.dat'
In [12]:
x,y=np.genfromtxt(url).T
In [13]:
x1=x[:-1]
y1=y[:-1]

Our two models

In [14]:
def poly9 (x,a0 ,a1 ,a2 ,a3 ,a4 ,a5 ,a6 ,a7 ,a8 ,a9):
    return np. poly1d ([a0 ,a1 ,a2 ,a3 ,a4 ,a5 ,a6 ,a7 ,a8 ,a9])(x)
def line(x,a,b):
    return a*x+b
In [15]:
print(len(y1))
10

As we have 10 points, we are using a 9th grade polynomial, so that it touches all the points.

The warning shown below is due to the fact that the method finds that the error is zero. (There is no distance between the line and the points)

In [16]:
popt1 , pcov1 = curve_fit (line ,x1 ,y1)
popt2 , pcov2 = curve_fit (poly9 ,x1 ,y1)
C:\Users\jsevilla\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\optimize\minpack.py:794: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)
In [17]:
x_plot =np. linspace (0 ,9 ,1000)
plt . plot (x_plot , line (x_plot ,* popt1 ))
plt . plot (x_plot , poly9 (x_plot ,* popt2 ))
plt . plot (x1 ,y1,'.')
Out[17]:
[<matplotlib.lines.Line2D at 0x2149815c898>]

But when we add a new point, the orange plot dows not describe the dataset anymore, but the blue one does.

In [18]:
x_plot2 =np. linspace (0 ,10 ,1000)
plt . plot (x_plot2 , line (x_plot2 ,* popt1 ))
plt . plot (x_plot2 , poly9 (x_plot2 ,* popt2 ))
plt . plot (x ,y,'.')
Out[18]:
[<matplotlib.lines.Line2D at 0x214981c28d0>]
In [19]:
popt3 , pcov3 = curve_fit (line ,x,y)
popt4 , pcov4 = curve_fit (poly9 ,x,y)

If we add the new point, and repeat the fit we get new values for the optimal parameters.

Let us see how different they are,

In [20]:
for i in range(len(popt1)):
    print(popt1[i],popt3[i])
2.0070068176643754 2.002404037751126
1.2186629402930595 1.2324712800328053
In [21]:
for i in range(len(popt2)):
    print(popt2[i],popt4[i])
0.0007582773494645842 6.993128355372365e-05
-0.031062001599914787 -0.0033470298885839183
0.5354718721067647 0.06703476040077067
-5.047269294749025 -0.7308349300533554
28.273770546993347 4.724757754896082
-95.64476841994585 -18.451825324898657
188.7946289546919 42.063461311916996
-195.3918116915053 -49.866684171791995
80.88774675459767 24.54508614817591
0.45282345061228935 0.4555279425159536

It is easier if we evaluate the percentage, (0 means that both are equal, and 1 means that the difference is of the 100%)

In [22]:
for i in range(len(popt1)):
    print((popt1[i]-popt3[i])/popt1[i])
0.002293355395078193
-0.01133072918129859
In [23]:
for i in range(len(popt2)):
    print((popt2[i]-popt4[i])/popt4[i])
9.84317791596159
8.28046734983197
6.9879732381442015
5.906168667089495
4.9841735838613825
4.183485467472099
3.4883284224925277
2.9182836183447756
2.2954761807022184
-0.0059370494128788

The linear model, remains close after the addition of the new point. But the Polynomial changes completely (Almost 10 times!!!)

Fitting a Gaussian

Again, we load the data and plot it

In [24]:
url='raw.githubusercontent.com/jmsevillam/Herramientas-Computacionales-UniAndes/master/Data/Spectra/137Cs_10min.dat'
In [25]:
data=np.genfromtxt(url)
In [26]:
plt.plot(data[:,0],data[:,1])
Out[26]:
[<matplotlib.lines.Line2D at 0x2149823c748>]

We select a fraction of the data.

In [27]:
New_x=data[280:400,0]
New_data=data[280:400,1]
In [28]:
plt.plot(New_x,New_data)
Out[28]:
[<matplotlib.lines.Line2D at 0x214982a8198>]

We define the model: a gaussian + brackground.

In [29]:
def Gaussian(x,mu,sigma,amp,a,b):
    return amp*np.exp(-0.5*(x-mu)**2/sigma**2)+a*x+b
In [30]:
popt,pcov=curve_fit(Gaussian,New_data,New_x)
C:\Users\jsevilla\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\optimize\minpack.py:794: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)

The model did not work

In [31]:
plt.plot(New_x,New_data,'.')
plt.plot(New_x,Gaussian(New_x,*popt))
Out[31]:
[<matplotlib.lines.Line2D at 0x214982d8c88>]

But, when adding the initial conditions

In [32]:
popt,pcov=curve_fit(Gaussian,New_x,New_data,p0=[350,10,40000,1,1])
In [33]:
plt.plot(New_x,New_data,'.')
plt.plot(New_x,Gaussian(New_x,*popt))
Out[33]:
[<matplotlib.lines.Line2D at 0x21498349080>]