Comparing two CSV files and exporting non-matches using Python -

- September 15, 2015

as beginner in using python i'm stuck yet again! i've got 2 csv files follows:

csv1 (master list)

id    name    code.    prop 12    sab     1234     zxc 12    sab     1236     zxc 12    sab     1233     zxc 12    sab     1234     zxc 11    asn     1234     abv 16    hgf     1233     aaa 19    mab     8765     bct 19    mab     8754     bct

csv2 (subset)

id    name    code.    prop 12    sab     1234     zxc 12    sab     1236     zxc 12    sab     1233     zxc 12    sab     1234     zxc 19    mab     8765     bct 19    mab     8754     bct

my goal try , use values in first column of csvs compare , identify not occur in csv2.

edit in above example rows id 11 , 16 csv1 (master list) should exported.

something consider. id although unique has multiple instances in both csv files (as demonstrated in sample data csv files above).

i have gone through few threads on website such this one. trying achieve exact opposite of asked here cannot understand solution on thread.

i have attempted results no avail. have attached code using below:

import csv  fopen1=open('c:\master.csv') fopen2=open('c:\subset.csv') foutput1=open('c:\untagged.csv', 'wb')   master=csv.reader(fopen1) subset=csv.reader(fopen2) untagged=csv.writer(foutput1)   count=0  subsetcopy=list()  header1=master.next() header2=subset.next()  untagged.writerow(header1)   row2 in subset:     subsetcopy.append(row2)   row1 in master:     row2 in subsetcopy:         if row1[0] != row2[0]:             count=count+1             untagged.writerow(row1)  print count

when run absurd results in order of millions (count). weird thing used exact same code without != (used == instead) achieve goal , worked charm. thought changing equality condition should give me opposite result. instead ends producing huge csv file nothiing useful. tried use dictionary realised may not work because of duplication of records in both files. important me instances of particular row in both files.

where going wrong? advice/suggestions welcome.

what doing wrong in loop -

for row1 in master:     row2 in subsetcopy:         if row1[0] != row2[0]:             count=count+1             untagged.writerow(row1)

for each row1 in master compare id against row2 (using id ) , if not equal write row1 untagged. cause row1 written untagged many times , many times there unrelated rows in subsetcopy , , not how check whether row1's id exists in subset.

you need first go through each row in subsetcopy , save in set, , compare checking against set. example -

import csv  fopen1=open('c:\master.csv') fopen2=open('c:\subset.csv') foutput1=open('c:\untagged.csv', 'wb')  master=csv.reader(fopen1) subset=csv.reader(fopen2) untagged=csv.writer(foutput1)  count=0  subsetcopy=set() header1=master.next() header2=subset.next() untagged.writerow(header1)  row2 in subset:     subsetcopy.add(row2[0])  row1 in master:     if row1[0] not in subsetcopy:             count=count+1             untagged.writerow(row1) print count

Search This Blog

Core code