This simple script computes some basic descriptive statistics, like mean, standard deviation, kurtosis, etc. on data column imported from a CSV file with use of pandas
. In addition, the script accepts argument --exclude_zeros
and computes the desired statistics excluding zeros. The script delivers the desired results. However, as I come from R background, I would be happy receive feedback on a proper / pythonic way of generating the desired results.
Data
The data pertains to geographic area sizes of neighbourhood geographies for Scotland and is publicly available. This and other similar data sets can be sourced from Scottish Government open data portal.
#!/Users/me/path/path/path/bin/python"""DZ Area checkThe script sources uses previously used area size file and producessome descriptive statistics. The script additionally computes statisticsexcluding zeros."""# Modules# Refresh requirements creation:# $ pipreqs --force ~/where/this/stuff/sits/import osimport argparseimport pandas as pdfrom tabulate import tabulateimport numpy as np# Main function running the programdef main(csv_data, exclude):"""Computer the desired area statisics""" data = pd.read_csv( filepath_or_buffer=csv_data, skiprows=7, encoding='utf-8', header=None, names=['datazone', 'usual_residenrs', 'area_hectares']) print('\nSourced table:\r') print(tabulate(data.head(), headers='keys', tablefmt='psql')) # Replace zero if required if exclude: data = data.replace(0, np.NaN) # Compute statistics area_mean = data.loc[:, "area_hectares"].mean() area_max = data.loc[:, "area_hectares"].max() area_min = data.loc[:, "area_hectares"].min() area_total = data.loc[:, "area_hectares"].sum() obs_count = data.loc[:, "area_hectares"].count() obs_dist = data.loc[:, "area_hectares"].nunique( ) # Count distinct observations area_variance = data.loc[:, "area_hectares"].var() area_median = data.loc[:, "area_hectares"].median() area_std = data.loc[:, "area_hectares"].std() area_skw = data.loc[:, "area_hectares"].skew() area_kurt = data.loc[:, "area_hectares"].kurtosis() # Create results object results = {'Statistic': ['Average', 'Max', 'Min', 'Total', 'Count', 'Count (distinct)','Variance', 'Median', 'SD', 'Skewness', 'Kurtosis' ],'Value': [ area_mean, area_max, area_min, area_total, obs_count, obs_dist, area_variance, area_median, area_std, area_skw, area_kurt ] } # Show results object print('\nArea statistics:\r') print( tabulate( results, headers='keys', tablefmt='psql', numalign='left', floatfmt='.2f')) return (results)# Import arguments. Solves running program as a module and as a standalone# file.if __name__ == '__main__': parser = argparse.ArgumentParser( description='Calculate basic geography statistics.', epilog='Data Zone Area Statistics\rKonrad') parser.add_argument('-i','--infile', nargs=1, type=argparse.FileType('r'), help='Path to data file with geography statistics.', default=os.path.join('/Users', 'me', 'folder', 'data', 'folder','import_folder', 'stuff.csv')) parser.add_argument('--exclude-zeros', dest='exclude_zeros', action='store_true', default=False) args = parser.parse_args() # Call main function and computse stats main(csv_data=args.infile, exclude=args.exclude_zeros)
Results
Sourced table:+----+------------+-------------------+-----------------+| | datazone | usual_residenrs | area_hectares ||----+------------+-------------------+-----------------|| 0 | S01000001 | 872 | 438.88 || 1 | S01000002 | 678 | 30.77 || 2 | S01000003 | 788 | 13.36 || 3 | S01000004 | 612 | 20.08 || 4 | S01000005 | 643 | 27.02 |+----+------------+-------------------+-----------------+Area statistics:+------------------+-------------+| Statistic | Value ||------------------+-------------|| Average | 1198.11 || Max | 116251.04 || Min | 0.00 || Total | 7793711.31 || Count | 6505.00 || Count (distinct) | 4200.00 || Variance | 35231279.23 || Median | 22.00 || SD | 5935.59 || Skewness | 9.77 || Kurtosis | 121.59 |+------------------+-------------+
Results (excluding zeros)
Sourced table:+----+------------+-------------------+-----------------+| | datazone | usual_residenrs | area_hectares ||----+------------+-------------------+-----------------|| 0 | S01000001 | 872 | 438.88 || 1 | S01000002 | 678 | 30.77 || 2 | S01000003 | 788 | 13.36 || 3 | S01000004 | 612 | 20.08 || 4 | S01000005 | 643 | 27.02 |+----+------------+-------------------+-----------------+Area statistics:+------------------+-------------+| Statistic | Value ||------------------+-------------|| Average | 1199.03 || Max | 116251.04 || Min | 1.24 || Total | 7793711.31 || Count | 6500.00 || Count (distinct) | 4199.00 || Variance | 35257279.16 || Median | 22.01 || SD | 5937.78 || Skewness | 9.77 || Kurtosis | 121.49 |+------------------+-------------+