c# - Task Parallel is unstable, using 100% CPU at times -
i'm testing out parallel c#. works fine, , using parallel faster normal foreach loops. however, @ times (like 1 out of 5 times), cpu reach 100% usage, causing parallel tasks slow. cpu setup i5-4570 8gb ram. have idea why problem occurs?
below codes used test out function
// using normal foreach concurrentbag<int> resultdata = new concurrentbag<int>(); stopwatch sw = new stopwatch(); sw.start(); foreach (var item in testdata) { if (item.equals(1)) { resultdata.add(item); } } console.writeline("normal foreach " + sw.elapsedmilliseconds); // using list parallel resultdata = new concurrentbag<int>(); sw.restart(); system.threading.tasks.parallel.for(0, testdata.count() - 1, (i, loopstate) => { int data = testdata[i]; if (data.equals(1)) { resultdata.add(data); } }); console.writeline("list parallel " + sw.elapsedmilliseconds); // using list parallel foreach //resultdata.clear(); resultdata = new concurrentbag<int>(); sw.restart(); system.threading.tasks.parallel.foreach(testdata, (item, loopstate) => { if (item.equals(1)) { resultdata.add(item); } }); console.writeline("list parallel foreach " + sw.elapsedmilliseconds); // using concurrent parallel concurrentstack<int> resultdata2 = new concurrentstack<int>(); sw.restart(); system.threading.tasks.parallel.for(0, testdata.count() - 1, (i, loopstate) => { int data = testdata[i]; if (data.equals(1)) { resultdata2.push(data); } }); console.writeline("concurrent parallel " + sw.elapsedmilliseconds); // using concurrent parallel foreach resultdata2.clear(); sw.restart(); system.threading.tasks.parallel.foreach(testdata, (item, loopstate) => { if (item.equals(1)) { resultdata2.push(item); } }); console.writeline("concurrent parallel foreach " + sw.elapsedmilliseconds);
normal output
normal foreach 493
list parallel 315
list parallel foreach 328
concurrent parallel 286
concurrent parallel foreach 292
during 100% cpu usage
normal foreach 476
list parallel 8047
list parallel foreach 276
concurrent parallel 281
concurrent parallel foreach 3960
(this can occur during of parallel tasks, above 1 instance)
update
by using plinq method provided @willaien , running 100 times, problem no longer occurs. still have no idea why issue surface in first place though.
var resultdata3 = testdata.asparallel().where(x => x == 1).tolist();
first of all, careful parallel
- doesn't shield thread safety issues. in original code, used non-thread-safe code when filling list of results. in general, want avoid sharing state (although read-only access list fine in case this). if want use parallel.for
or parallel.foreach
filtering , aggregation (really, asparallel
want in cases), should use overload thread-local state - you'd final results aggregation in localfinally
delegate (note it's still run on different thread, need ensure thread-safety; however, locking fine in case, since you're doing once per thread, rather on every iteration).
now, obvious first thing try in problem use profiler. i've done that. results follows:
- there hardly memory allocations in either of solutions. entirely dwarfed initial test data allocation, relatively small test data (i used 1m, 10m , 100m of integers when testing).
- the work being done in e.g.
parallel.for
orparallel.foreach
bodies themselves, not in code (the simpleif (data[i] == 1) results.add(data[i])
).
the first means can gc isn't culprit. indeed, doesn't chance run. second more curious - means in cases, overhead of parallel
way out of line - it's seemingly random, works without hitch, , takes half second. point gc, we've ruled out already.
i've tried using overload without loop state, didn't help. i've tried limiting maxdegreeofparallelism
, ever hurt things. now, obviously, code absolutely dominated cache access - there's hardly cpu work , no i/o - favour single-threaded solution; using maxdegreeofparallelism
of 1 doesn't - indeed, 2 seems fastest on system. more useless - again, cache access dominates. it's still curious - i'm using server cpu tests, has plenty of cache of data @ once, , while we're not doing 100% sequential access (which pretty gets rid of latency entirely), should quite sequential enough. regardless, have base line of memory throughput in single-threaded solution, , it's close speed of parallelised case when works (parallelised, i'm reading 40% less runtime single-threaded, on four-core server cpu embarassingly parallel problem - again, obviously, memory access limit).
so, it's time check reference source parallel.for
. in case this, creates ranges based on amount of workers - 1 range each. it's not ranges - there's no overhead that. core runs task iterates on given range. there's few interesting bits - example, task "suspended" if takes long. however, doesn't seem fit data - why cause random delays unrelated data size? no matter how small work job, , no matter how low maxdegreeofparallelism
is, "random" slowdowns. might problem, have no idea how check it.
the interesting thing expanding test data nothing anomaly - while makes "good" parallel runs faster (even getting close perfect efficiency in tests, oddly enough), "bad" ones still bad. in fact, in few of test runs, absurdly bad (up ten times "normal" loop).
so, let's have @ threads. i've artifically bumped amount of threads in threadpool
ensure expanding thread pool isn't bottleneck (it shouldn't if worked well, but...). , here comes first surprise - while "good" runs use 4-8 threads make sense, "bad" runs expand on available threads in pool, if there's hundred of them. oops?
let's dive source code once again. parallel
internally uses task.runsynchronously
run root partitioned work job, , wait
s on result. when @ parallel stacks, there's 97 threads executing loop body, , 1 has runsynchronously
on stack (as expected - that's main thread). others plain threadpool threads. task ids tell story - there's thousands of individual tasks being created while doing iteration. obviously, very wrong here. if remove whole loop body, still happens, it's not closure weirdness either.
explicitly setting maxdegreeofparallelism
offsets - amount of threads used doesn't explode anymore - however, amount of tasks still does. we've seen ranges amount of parallel tasks running - why keep creating more , more tasks? using debugger confirms - maxdop of four, there's 5 ranges (there's alignment causes fifth range). interestingly, 1 of completed ranges (how did first 1 finish ahead of rest?) has index higher range iterates - because "scheduler" assigns range-partitions in slices of 16.
the root task self-replicating, instead of explicitly starting e.g. 4 tasks handle data, waits scheduler replicate task handle more data. it's kind of hard read - we're talking complex multi-threaded lock-less code, seems always assigns work in slices smaller partitioned ranges. in testing, maximal size of slice 16 - far cry millions of data i'm running. 16 iterations body no time @ all, might lead many issues algorithm (the biggest being infrastructure taking more cpu work actual iterator body). in cases, cache trashing might impact performance further (perhaps when there's lot of variation in body runtimes), of time, access sequential enough.
tl; dr
don't use parallel.for
, parallel.foreach
if work-per-iteration short (on order of milliseconds). asparallel
or running iteration single-threaded faster.
slightly longer explanation:
it seems parallel.for
, paraller.foreach
designed scenarios individual items you're iterating on take substantial amount of time execute (i.e. lots of work per item, not tiny amounts of work on lot of items). seem perform badly when iterator body short. if you're not doing substantial work in iterator body, use asparallel
instead of parallel.*
. sweetspot seems somewhere under 150ms per slice (around 10ms per iteration). otherwise, parallel.*
spend tons of time in own code, , hardly time doing iteration (in case, usual number somewhere around 5-10% in body - embarassingly bad).
sadly, didn't find warning on msdn - there's samples going on substantial amounts of data, there's no hint @ terrible performance hit of doing so. testing same sample code on computer, i've found indeed slower single-threaded iteration, , @ best of times, barely faster (around 30-40% time savings while running on 4 cpu cores - not efficient).
edit:
willaien found mention on msdn issue, , how solve - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx. idea use custom partitioner , iterate on in parallel.for
body (e.g. loop in parallel.for
's loop). however, cases, using asparallel
still better choice - simple loop bodies mean kind of map/reduce operation, , asparallel
, linq in general great @ that. example, sample code rewritten as:
var result = testdata.asparallel().where(i => == 1).tolist();
the case using asparallel
bad idea same other linqs - when loop body has side-effects. might tolerable, it's safer avoid them altogether.
Comments
Post a Comment