python - Leading or trailing whitespace and pandas value_counts vs boolean selection -
i working dataframe created csv file downloaded county's sheriff's department. data located here , can read in using read_csv()
. dataframe contains information incidents reported , acted upon sheriff. 1 of columns city in incident occurred, , i'm trying create table , graph showing change in number of incidents area (larkfield) on time.
when use panda's value_counts function using "city" input, get
in [86]: compcounts = soco['city'].value_counts() in [96]: compcounts[0:10] out[96]: santa rosa 55291 windsor 31711 sonoma 28840 guerneville 9309 boyes hot springs 8006 petaluma 6103 el verano 5969 geyserville 5822 larkfield 5398 forestville 5312 dtype: int64`
there 5398 reports area ('larkfield'). when try subset of dataframe area, using
larkfieldcomps = soco[soco['city'] == "larkfield"]
it returns 115 values, not 5398:
in [94]: larkcounts = larkfieldcomps['year'].value_counts() in [95]: larkcounts out[95]: 2015 114 2013 1 dtype: int64
i thought maybe problem in entries there 1 or more spaces before or after "larkfield" in incident description, did search/replace try strip out spaces, still 115 values when searching "larkfield," though know there many more incidents in area.
this first question on stackoverflow ... i've researched death haven't come answer yet. suggestions appreciated.
i can explain after downloading data (and reading dataframe read_csv
using default settings). appears there leading or trailing spaces in there. apparently value_counts
smart enough ignore when adding things boolean selection more literal.
>>> soco[soco['city'] == "larkfield"].city.count() 122 >>> soco['city2'] = soco.city.str.strip() >>> soco[soco['city2'] == "larkfield"].city.count() 5520
and when little closer seems 5398 have 11 trailing spaces , 122 have no spaces. that's difference. (i'm not sure why find 115 values year instead of 122, that's due missing values year, created it.)
but did double check behavior of value_counts
because had been assuming leading , trailing spaces matter.
>>> pd.series( [' foo','foo','foo '] ).value_counts() foo 1 foo 1 foo 1
and, yeah, in simple example leading , trailing blanks indeed matter. don't in 'soco' dataframe???
so there still loose ends here, start figuring out happening here.
Comments
Post a Comment