Friday, June 30, 2017

Simple code to generate synthetic time series data in Python / Pandas

Here is a simple code to generate synthetic time series.

import numpy as np
import pandas as pd

med = 15.5
dp = 8.2
sDays = np.arange('2001-01', '2016-12', dtype='datetime64[D]')
nDays = len(sDays)

s1 = np.random.gumbel(loc=med,scale=dp,size=nDays)
s1[s1 < 0] = 0

dfSint = pd.DataFrame({'Q':s1},index=sDays)
dfSint.plot()

Saturday, June 24, 2017

Pandas - How to read text files delimited with fixed widths

With Python Pandas library it is possible to easily read fixed width text files, for example:


In this case, the text file has its first 4 lines without data and the 5th line with the header. The header and the data are delimeted with fixed char widths, being the widths sizes as following:
  •  12 spaces , 10 spaces ,6 spaces ,9 spaces ,7 spaces,7 spaces ,7 spaces ,4 spaces
The following code will read the file as a pandas DataFrame, and also parse the dates in the datetime format:

import pandas as pd

ds2 = pd.read_fwf('yourtextfile.txt', widths=[12,10,6,9,7,7,7,4], skiprows=4, parse_dates=True)



Wednesday, June 21, 2017

Exponential curve fit in numpy

With numpy function "polyfit" we can easily fit diferent kind of curves, not only polynomial curves.

According to the users manual, the numpy.polyfit does:


"
Least squares polynomial fit.

Fit a polynomial p(x) = p[0] * x**deg + ... + p[deg] of degree deg to points (x, y). Returns a vector of coefficients p that minimises the squared error.
"


If we use X and y as arrays with our data, the code:

coef = np.polyfit(X, np.log(y), 1)


will return two coefficients, who will compose the equation:

exp(coef[1])*exp(coef[0]*X)

Giving you the exponential curve that better fits our data - X and y.
The polyfit function can receive weight values, which we can use in case of giving less importance to very small values, for example. We can use a weight function as following:

coef = np.polyfit(X, np.log(y), 1, w=np.sqrt(y))


Giving more weight to higher values.

To retrieve the R-squared index of our exponenctial curve, we can use de scikit r2_score, as following:
y_pred = np.exp(coefs[1])*np.exp(coefs[0]*X)

from sklearn.metrics import r2_score

r2s = r2_score(y, y_pred, sample_weight=None, multioutput=None)

Wednesday, June 7, 2017

Python and Pandas - How to plot Multiple Curves with 5 Lines of Code

In this post I will show how to use pandas to do a minimalist but pretty line chart, with as many curves we want.

In this case I will use a I-D-F precipitation table, with lines corresponding to Return Periods (years) and columns corresponding to durations, in minutes. as shown below:


For the code to work properly, the table must have headers in the columns and lines, and the first cell have to be blank. Select the table you want in your SpreadSheet Editor, and copy it to clipboard.

Then, run the following code:


import pandas as pd

table = pd.read_clipboard()
tabTr = table.transpose().convert_objects(convert_numeric=True)
eixox = tabTr.index.values.astype(float)
tabTr.set_index(eixox).plot(grid=True)

And Voila!:


Friday, June 2, 2017

What is PANDAS? - Pandas in Hydrology

As stated in the Wikipedia:
"...
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license.[2] The name is derived from the term "panel data", an econometrics term for multidimensional structured data sets...."Pandas is a library that can easily deal with datasets, and together with numpy and scipy, can solve a great number of hydrology and hydraulics problems.
"

Pandas can easily read text/csv files, and can categorize and make operations on its data with few lines of code.

First, we have always to import pandas library with:

import pandas as pd



To read a csv timeseries of precipitation daily data, we can write:

dataSeries = pd.read_csv('csvfile.csv', index_col=0, parse_dates=True)


if the index column is the first one, and it have dates in standard format.



To get average and standard deviation, just write:

m1,d1 = serY.mean(), serY.std()


And to make an easy and beautiful histogram of this data, just write:

dataSeries.hist()


Pandas documentation is available on the site:http://pandas.pydata.org/pandas-docs/stable/install.html


Happy analyzing!