python - Leading or trailing whitespace and pandas value_counts vs boolean selection -


i working dataframe created csv file downloaded county's sheriff's department. data located here , can read in using read_csv(). dataframe contains information incidents reported , acted upon sheriff. 1 of columns city in incident occurred, , i'm trying create table , graph showing change in number of incidents area (larkfield) on time.

when use panda's value_counts function using "city" input, get

in [86]: compcounts = soco['city'].value_counts() in [96]: compcounts[0:10] out[96]: santa rosa              55291 windsor                 31711 sonoma                  28840 guerneville              9309 boyes hot springs        8006 petaluma                 6103 el verano                5969 geyserville              5822 larkfield                5398 forestville              5312 dtype: int64` 

there 5398 reports area ('larkfield'). when try subset of dataframe area, using

larkfieldcomps = soco[soco['city'] == "larkfield"] 

it returns 115 values, not 5398:

in [94]: larkcounts = larkfieldcomps['year'].value_counts() in [95]: larkcounts out[95]: 2015    114 2013      1 dtype: int64 

i thought maybe problem in entries there 1 or more spaces before or after "larkfield" in incident description, did search/replace try strip out spaces, still 115 values when searching "larkfield," though know there many more incidents in area.

this first question on stackoverflow ... i've researched death haven't come answer yet. suggestions appreciated.

i can explain after downloading data (and reading dataframe read_csv using default settings). appears there leading or trailing spaces in there. apparently value_counts smart enough ignore when adding things boolean selection more literal.

>>> soco[soco['city'] == "larkfield"].city.count() 122  >>> soco['city2'] = soco.city.str.strip()  >>> soco[soco['city2'] == "larkfield"].city.count() 5520 

and when little closer seems 5398 have 11 trailing spaces , 122 have no spaces. that's difference. (i'm not sure why find 115 values year instead of 122, that's due missing values year, created it.)

but did double check behavior of value_counts because had been assuming leading , trailing spaces matter.

>>> pd.series( [' foo','foo','foo '] ).value_counts()  foo     1 foo     1  foo    1 

and, yeah, in simple example leading , trailing blanks indeed matter. don't in 'soco' dataframe???

so there still loose ends here, start figuring out happening here.


Comments

Popular posts from this blog

php - Admin SDK -- get information about the group -

dns - How To Use Custom Nameserver On Free Cloudflare? -

Python Error - TypeError: input expected at most 1 arguments, got 3 -