Python in Hydrology and Hydraulics: pandas

Showing posts with label pandas. Show all posts

Thursday, August 11, 2022

Pandas- Creating a blank dataframe between two dates

import pandas as pd

dtmin,dtmax=pd.Timestamp("2022-08-10 00:00"),pd.Timestamp("2022-08-11 00:00")

df00=pd.DataFrame(index=pd.date_range(dtmin,dtmax,freq='10min'),columns=['value'])

Frequency can be: '1d', '10min', '30min', among others in the right format.

Tuesday, April 20, 2021

Pandas - Reading CSV file with variable number of columns

If the CSV/ text file has different number of columns along the rows, it will fail reading with simple "pd.read_csv(file)".

Instead of it, try naming the columns. It will read until the number of columns supplied.

For example:

df0=pd.read_csv(filename, sep=';',names=['a', 'b', 'c', 'd', 'e'])

It will read only until 5 columns wide.

Monday, March 15, 2021

Return values that meet some criteria based in other columns - Pandas

Some simple tasks are much faster and simpler in pandas than in Excel.

For example: return values that meet some criteria based in other columns.

If we want to list all the basins with area greater than 100ha, a simple code will do.

Assuming we had copied this table from excel:

import pandas as pd

df0=pd.read_clipboard()

df0[df0.iloc[:,-1]>100].to_clipboard()

This code will put the result in the clipboard, to paste back into Excel (for ex.).

Wednesday, July 8, 2020

Downloading table frow website to Pandas Dataframe - html to pandas

In some cases it is possible to download only with pandas command:

pd.read_html(url)

If the command returns with the html response "forbidden" we can use the requests library, sending some header information to prevent this error.

Example: downloading world population data from site www.worldometers.info:

import requests

import pandas as pd

url = r'https://www.worldometers.info/world-population/population-by-country/'

header = {

"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",

"X-Requested-With": "XMLHttpRequest"

}

r = requests.get(url, headers=header)

#world population - first table of the website

dfpop=pd.read_html(r.text)[0]

Monday, October 21, 2019

Delete row/ column in DataFrame with Null/ NaN values

if df0 is a Pandas DataFrame with null Values:

df0=df0.dropna(axis=0,how='all') - will remove rows that have only 'NaN' values

df0=df0.dropna(axis=1,how='all') - will remove columns that have only 'NaN' values

Thursday, June 13, 2019

Find Maximum Values by year in timeseries dataframe, keeping the date

df2=df1.ix[df1.groupby(df1.index.year).idxmax().iloc[:,0]]

Tuesday, May 28, 2019

Find common timespan/ years in multiple time series/ dataframes

#concatenate vertically all dataframes
dfAlldf=pd.concat([df1, df2,df3,df4], axis=1)

#sort them by the date (datetimeindex)

dfAlldf=dfAlldf.sort_index()

#group dataframe by years

grps=dfAlldf.groupby(dfAlldf.index.year)

#empty dataframe for populating with complete years

dfCompl=pd.DataFrame()

# for each group of years

for g in grps:

#if don't have any null values in year, in any column

if not any(g[1].isnull().any(axis=1)):

#concatenate in dfCompl

dfCompl=pd.concat([dfCompl, g[1]], axis=0)

# re-sort by index

dfCompl=dfCompl.sort_index()

Thursday, April 25, 2019

Pandas - Reading headers and dates correctly from Clipboard/ CSV

When using pandas funcions read_clipboard() or read_csv() you have to define if your data has headers (column headers) and indexes (row headers).

If you're passing indexes with datetime format, make sure if it will be parsed correctly, indicating it's a datetime and if it has dayfirst format (dd/mm/YYYY).

For example:

pd.read_clipboard(index_col=0, headers=None,parse_dates=True, dayfirst=True)

Is telling pandas that the table in clipboard has no column headers, but have index (row headers) in the first column and it is in datetime format with day first (dd/mm/YYYY).

Wednesday, January 23, 2019

Interpolate missing values in pandas DataFrame

If we have a dataframe with dates and flows - with missing values, as example below:

0
2019-01-31 50.208308
2019-02-28 50.623457
2019-03-31 56.203933
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 117.727655
2019-07-31 62.273259
2019-08-31 49.054898
2019-09-30 55.612575
2019-10-31 54.187409

We can use the function pandas interpolate, and interpolate the data with different methods

dfIn.interpolate() - will fill noData with linear interpolation;
dfIn.interpolate(method='polynomial', order=3) - will fill noData with 3rd degree polinomial interpolation;

Result:
linear polinomial original
2019-01-31 50.208308 50.208308 50.208308
2019-02-28 50.623457 50.623457 50.623457
2019-03-31 56.203933 56.203933 56.203933
2019-04-30 76.711840 89.513986 NaN
2019-05-31 97.219748 124.233259 NaN
2019-06-30 117.727655 117.727655 117.727655
2019-07-31 62.273259 62.273259 62.273259
2019-08-31 49.054898 49.054898 49.054898
2019-09-30 55.612575 55.612575 55.612575
2019-10-31 54.187409 54.187409 54.187409

Wednesday, November 14, 2018

Pandas - Select only numeric columns

df=df.select_dtypes(include=['float64'])

will return only the float values of the 'df' dataframe.

Tuesday, September 4, 2018

Excel to Python Pandas DataFrame

The most easy way is to use the clipboard. Just copy your table in Excel to the clipboard, and then call the read_clipboard method from pandas:

df1 = pd.read_clipboard()

Pandas can read Excel files directly too, with read_excel

df1 = pd.read_excel('tmp.xlsx',index_col=0, header=0)

Taking care with header and indexes, and other options as described here.

Tuesday, August 21, 2018

Maximum, minimum and average monthly precipitation

Below is shown some panda commands for retrieving maximum, minimum and average monthly precipitation from daily precipitation data.

The daily precipitation is assumed to be in a pandas DataFrame, with its index in Datetime index format.

1 - Daily to monthly precipitation
df_m=df1.resample('M').sum()

2 - Maximum monthly precipitation
p_max=df_m.groupby(df_m.index.month).max()

3 - Minimum monthly precipitation
p_min=df_m.groupby(df_m.index.month).min()

4 - Average monthly precipitation
p_avg=df_m.groupby(df_m.index.month).mean()

Tuesday, July 10, 2018

Fitting IDF curves with Scipy and Pandas

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 10 11:19:37 2018

@author: rodrigo.goncalves
"""
import pandas as pd
from scipy.optimize import minimize
 
# This is the IDF function, returning the sum of squared errors (SSE)
def func2(par, res):
    p1 =  (par[0] * res.index.values  **par[1])
    p2 = ((res.columns.values.astype(float)+par[2])**par[3])
    f = pd.DataFrame([p/p2 for p in p1],index=res.index,columns=res.columns)
    erroTotQ=((f-res)**2).sum(axis=1).sum()    
    return erroTotQ
 
# copy your rainfall intensities table from excel
# with column headers with rainfall durations
# and row names with Return period value in years
dfInt=pd.read_clipboard()

#initial guess
param1 = [5000, 0.1, 10, 0.9]

res2 = minimize(func2, param1, args=(dfInt,), method='Nelder-Mead')
print(res2)
cs=['K=','a=','b=','c=']
dfResult=pd.DataFrame(res2.x,index=cs).transpose()
print(dfResult)
dfResult.to_clipboard(index=None)

Wednesday, May 30, 2018

Pandas - Operations between rows - distance between 2 points

If we have a table with a column with xy coordinates, for example:

We can get the difference between consecutive rows by using Pandas SHIFT function on columns.
".shift(-1)" will roll the rows 1 position backwards, and ".shift(1)" or simply ".shift()" will roll down your column by 1 position of the rows.

In our example, df1['x'].shift() will return:

0 NaN
1 455395.996360
2 527627.076641
3 536278.269190
4 553932.441097
5 568699.553239
6 569709.130272
7 573016.302437
8 575141.096777
9 580107.934566

if we want to calculate the euclidean distance between consecutive points, we can use the shift associated with numpy functions numpy.sqrt and numpy.power as following:

df1['diff']= np.sqrt(np.power(df1['x'].shift()-df1['x'],2)+
np.power(df1['y'].shift()-df1['y'],2))

Resulting in:

0 NaN
1 89911.101224
2 21323.016099
3 204394.524574
4 37767.197793
5 46692.771398
6 13246.254235
7 2641.201366
8 15153.187527
9 15853.974422

Friday, April 13, 2018

IDF table from equation - Pandas

How to fill a table (pandas DataFrame) with rainfall intensity values, extracted from idf equation:

import pandas as pd

durs = [5,10,15,20,30,60,120,180,360,540,720,1440] # rainfall durations
rps=[5,10,25,50,100] # return periods

t1 = pd.DataFrame(index=rps,columns=durs) # initialize DataFrame with rp's and durations

cf = [942.76,.1242,17.0,.650] #example idf equation coefficients

table = t1.apply(lambda x: (cf[0]*x.index**cf[1])/((x.name+cf[2])**cf[3]))

Thursday, February 8, 2018

Daily precipitation data to monthly and yearly

With pandas, we can group datetime values according to datetime units, as months and years.

If we have a pandas dataframe with 2 columns - DATE and VALUE

Certify that the column DATE is recognized as datetime64 format - you can use, for example:

df1['DATE'] = pd.to_datetime(df1['DATE'])

and then you can group and make operations on this group.

Getting the average of total monthly precipitation:

monthly = df1.groupby(df1['DATE'].dt.month).sum() / len(pd.unique(df1['DATE'].dt.year))

(groups by month sum and then divide by the number of years)

Getting the average yearly precipitation:

PAnual = df1.groupby(df1[0].dt.year).sum().mean()

Friday, January 26, 2018

Accumulating Precipitation Data

If we have a pandas dataframe named df1 with a column '15min' containing 15 minute precipitation data, we can easily accumulate for other durations, using the rolling method as shown in the example below:

df1['01 h'] =df1['15min'].rolling(window=4,center=False).sum()

Note that the window=4 parameter means that it will accumulate 4 lines of 15 minute data, resulting in 1 hour precipitation. This parameter can be changed to whatever duration you want.

Sunday, November 26, 2017

Pandas sum column values according to another columns value

One-liner code to sum Pandas second columns according to same values in the first column.

df2 = df1.groupby(df1.columns[0])[df1.columns[1]].sum().reset_index()

For example, applying to a table listing pipe diameters and lenghts, the command will return total lenghts according to each unique diameters.

This functionality is similar to excel's pivot table sum.

Friday, June 30, 2017

Simple code to generate synthetic time series data in Python / Pandas

Here is a simple code to generate synthetic time series.

import numpy as np
import pandas as pd

med = 15.5
dp = 8.2
sDays = np.arange('2001-01', '2016-12', dtype='datetime64[D]')
nDays = len(sDays)

s1 = np.random.gumbel(loc=med,scale=dp,size=nDays)
s1[s1 < 0] = 0

dfSint = pd.DataFrame({'Q':s1},index=sDays)
dfSint.plot()

Saturday, June 24, 2017

Pandas - How to read text files delimited with fixed widths

With Python Pandas library it is possible to easily read fixed width text files, for example:

In this case, the text file has its first 4 lines without data and the 5th line with the header. The header and the data are delimeted with fixed char widths, being the widths sizes as following:

12 spaces , 10 spaces ,6 spaces ,9 spaces ,7 spaces,7 spaces ,7 spaces ,4 spaces

The following code will read the file as a pandas DataFrame, and also parse the dates in the datetime format:

import pandas as pd

ds2 = pd.read_fwf('yourtextfile.txt', widths=[12,10,6,9,7,7,7,4], skiprows=4, parse_dates=True)