Comparing two CSV files and exporting non-matches using Python -
as beginner in using python i'm stuck yet again! i've got 2 csv files follows:
csv1 (master list)
id name code. prop 12 sab 1234 zxc 12 sab 1236 zxc 12 sab 1233 zxc 12 sab 1234 zxc 11 asn 1234 abv 16 hgf 1233 aaa 19 mab 8765 bct 19 mab 8754 bct
csv2 (subset)
id name code. prop 12 sab 1234 zxc 12 sab 1236 zxc 12 sab 1233 zxc 12 sab 1234 zxc 19 mab 8765 bct 19 mab 8754 bct
my goal try , use values in first column of csvs compare , identify not occur in csv2.
edit in above example rows id 11 , 16 csv1 (master list) should exported.
something consider. id although unique has multiple instances in both csv files (as demonstrated in sample data csv files above).
i have gone through few threads on website such this one. trying achieve exact opposite of asked here cannot understand solution on thread.
i have attempted results no avail. have attached code using below:
import csv fopen1=open('c:\master.csv') fopen2=open('c:\subset.csv') foutput1=open('c:\untagged.csv', 'wb') master=csv.reader(fopen1) subset=csv.reader(fopen2) untagged=csv.writer(foutput1) count=0 subsetcopy=list() header1=master.next() header2=subset.next() untagged.writerow(header1) row2 in subset: subsetcopy.append(row2) row1 in master: row2 in subsetcopy: if row1[0] != row2[0]: count=count+1 untagged.writerow(row1) print count
when run absurd results in order of millions (count). weird thing used exact same code without != (used == instead) achieve goal , worked charm. thought changing equality condition should give me opposite result. instead ends producing huge csv file nothiing useful. tried use dictionary realised may not work because of duplication of records in both files. important me instances of particular row in both files.
where going wrong? advice/suggestions welcome.
what doing wrong in loop -
for row1 in master: row2 in subsetcopy: if row1[0] != row2[0]: count=count+1 untagged.writerow(row1)
for each row1 in master
compare id against row2
(using id
) , if not equal write row1 untagged. cause row1
written untagged
many times , many times there unrelated rows in subsetcopy
, , not how check whether row1
's id exists in subset.
you need first go through each row in subsetcopy
, save in set, , compare checking against set. example -
import csv fopen1=open('c:\master.csv') fopen2=open('c:\subset.csv') foutput1=open('c:\untagged.csv', 'wb') master=csv.reader(fopen1) subset=csv.reader(fopen2) untagged=csv.writer(foutput1) count=0 subsetcopy=set() header1=master.next() header2=subset.next() untagged.writerow(header1) row2 in subset: subsetcopy.add(row2[0]) row1 in master: if row1[0] not in subsetcopy: count=count+1 untagged.writerow(row1) print count
Comments
Post a Comment