Quantcast
Viewing latest article 2
Browse Latest Browse All 2

Python program computing some statistics on Scottish geographic areas

This simple script computes some basic descriptive statistics, like mean, standard deviation, kurtosis, etc. on data column imported from a CSV file with use of pandas. In addition, the script accepts argument --exclude_zeros and computes the desired statistics excluding zeros. The script delivers the desired results. However, as I come from R background, I would be happy receive feedback on a proper / pythonic way of generating the desired results.

Data

The data pertains to geographic area sizes of neighbourhood geographies for Scotland and is publicly available. This and other similar data sets can be sourced from Scottish Government open data portal.

#!/Users/me/path/path/path/bin/python"""DZ Area checkThe script sources uses previously used area size file and producessome descriptive statistics. The script additionally computes statisticsexcluding zeros."""# Modules# Refresh requirements creation:# $ pipreqs --force ~/where/this/stuff/sits/import osimport argparseimport pandas as pdfrom tabulate import tabulateimport numpy as np# Main function running the programdef main(csv_data, exclude):"""Computer the desired area statisics"""    data = pd.read_csv(        filepath_or_buffer=csv_data,        skiprows=7,        encoding='utf-8',        header=None,        names=['datazone', 'usual_residenrs', 'area_hectares'])    print('\nSourced table:\r')    print(tabulate(data.head(), headers='keys', tablefmt='psql'))    # Replace zero if required    if exclude:        data = data.replace(0, np.NaN)    # Compute statistics    area_mean = data.loc[:, "area_hectares"].mean()    area_max = data.loc[:, "area_hectares"].max()    area_min = data.loc[:, "area_hectares"].min()    area_total = data.loc[:, "area_hectares"].sum()    obs_count = data.loc[:, "area_hectares"].count()    obs_dist = data.loc[:, "area_hectares"].nunique(    )  # Count distinct observations    area_variance = data.loc[:, "area_hectares"].var()    area_median = data.loc[:, "area_hectares"].median()    area_std = data.loc[:, "area_hectares"].std()    area_skw = data.loc[:, "area_hectares"].skew()    area_kurt = data.loc[:, "area_hectares"].kurtosis()    # Create results object    results = {'Statistic': ['Average', 'Max', 'Min', 'Total', 'Count', 'Count (distinct)','Variance', 'Median', 'SD', 'Skewness', 'Kurtosis'        ],'Value': [            area_mean, area_max, area_min, area_total, obs_count, obs_dist,            area_variance, area_median, area_std, area_skw, area_kurt        ]    }    # Show results object    print('\nArea statistics:\r')    print(        tabulate(            results,            headers='keys',            tablefmt='psql',            numalign='left',            floatfmt='.2f'))    return (results)# Import arguments. Solves running program as a module and as a standalone# file.if __name__ == '__main__':    parser = argparse.ArgumentParser(        description='Calculate basic geography statistics.',        epilog='Data Zone Area Statistics\rKonrad')    parser.add_argument('-i','--infile',        nargs=1,        type=argparse.FileType('r'),        help='Path to data file with geography statistics.',        default=os.path.join('/Users', 'me', 'folder', 'data', 'folder','import_folder', 'stuff.csv'))    parser.add_argument('--exclude-zeros',        dest='exclude_zeros',        action='store_true',        default=False)    args = parser.parse_args()    # Call main function and computse stats    main(csv_data=args.infile, exclude=args.exclude_zeros)

Results

Sourced table:+----+------------+-------------------+-----------------+|    | datazone   |   usual_residenrs |   area_hectares ||----+------------+-------------------+-----------------||  0 | S01000001  |               872 |          438.88 ||  1 | S01000002  |               678 |           30.77 ||  2 | S01000003  |               788 |           13.36 ||  3 | S01000004  |               612 |           20.08 ||  4 | S01000005  |               643 |           27.02 |+----+------------+-------------------+-----------------+Area statistics:+------------------+-------------+| Statistic        | Value       ||------------------+-------------|| Average          | 1198.11     || Max              | 116251.04   || Min              | 0.00        || Total            | 7793711.31  || Count            | 6505.00     || Count (distinct) | 4200.00     || Variance         | 35231279.23 || Median           | 22.00       || SD               | 5935.59     || Skewness         | 9.77        || Kurtosis         | 121.59      |+------------------+-------------+

Results (excluding zeros)

Sourced table:+----+------------+-------------------+-----------------+|    | datazone   |   usual_residenrs |   area_hectares ||----+------------+-------------------+-----------------||  0 | S01000001  |               872 |          438.88 ||  1 | S01000002  |               678 |           30.77 ||  2 | S01000003  |               788 |           13.36 ||  3 | S01000004  |               612 |           20.08 ||  4 | S01000005  |               643 |           27.02 |+----+------------+-------------------+-----------------+Area statistics:+------------------+-------------+| Statistic        | Value       ||------------------+-------------|| Average          | 1199.03     || Max              | 116251.04   || Min              | 1.24        || Total            | 7793711.31  || Count            | 6500.00     || Count (distinct) | 4199.00     || Variance         | 35257279.16 || Median           | 22.01       || SD               | 5937.78     || Skewness         | 9.77        || Kurtosis         | 121.49      |+------------------+-------------+

Viewing latest article 2
Browse Latest Browse All 2

Trending Articles