c# - Task Parallel is unstable, using 100% CPU at times -

i'm testing out parallel c#. works fine, , using parallel faster normal foreach loops. however, @ times (like 1 out of 5 times), cpu reach 100% usage, causing parallel tasks slow. cpu setup i5-4570 8gb ram. have idea why problem occurs?

below codes used test out function

            // using normal foreach             concurrentbag<int> resultdata = new concurrentbag<int>();             stopwatch sw = new stopwatch();             sw.start();             foreach (var item in testdata)             {                 if (item.equals(1))                 {                     resultdata.add(item);                 }             }             console.writeline("normal foreach " + sw.elapsedmilliseconds);              // using list parallel             resultdata = new concurrentbag<int>();             sw.restart();             system.threading.tasks.parallel.for(0, testdata.count() - 1, (i, loopstate) =>             {                 int data = testdata[i];                 if (data.equals(1))                 {                     resultdata.add(data);                 }             });             console.writeline("list parallel " + sw.elapsedmilliseconds);              // using list parallel foreach             //resultdata.clear();             resultdata = new concurrentbag<int>();             sw.restart();             system.threading.tasks.parallel.foreach(testdata, (item, loopstate) =>             {                 if (item.equals(1))                 {                     resultdata.add(item);                 }             });             console.writeline("list parallel foreach " + sw.elapsedmilliseconds);              // using concurrent parallel              concurrentstack<int> resultdata2 = new concurrentstack<int>();             sw.restart();             system.threading.tasks.parallel.for(0, testdata.count() - 1, (i, loopstate) =>             {                 int data = testdata[i];                 if (data.equals(1))                 {                     resultdata2.push(data);                 }             });             console.writeline("concurrent parallel " + sw.elapsedmilliseconds);              // using concurrent parallel foreach             resultdata2.clear();             sw.restart();             system.threading.tasks.parallel.foreach(testdata, (item, loopstate) =>             {                 if (item.equals(1))                 {                     resultdata2.push(item);                 }             });             console.writeline("concurrent parallel foreach " + sw.elapsedmilliseconds);

normal output

normal foreach 493

list parallel 315

list parallel foreach 328

concurrent parallel 286

concurrent parallel foreach 292

during 100% cpu usage

normal foreach 476

list parallel 8047

list parallel foreach 276

concurrent parallel 281

concurrent parallel foreach 3960

(this can occur during of parallel tasks, above 1 instance)

update

by using plinq method provided @willaien , running 100 times, problem no longer occurs. still have no idea why issue surface in first place though.

var resultdata3 = testdata.asparallel().where(x => x == 1).tolist();

first of all, careful parallel - doesn't shield thread safety issues. in original code, used non-thread-safe code when filling list of results. in general, want avoid sharing state (although read-only access list fine in case this). if want use parallel.for or parallel.foreach filtering , aggregation (really, asparallel want in cases), should use overload thread-local state - you'd final results aggregation in localfinally delegate (note it's still run on different thread, need ensure thread-safety; however, locking fine in case, since you're doing once per thread, rather on every iteration).

now, obvious first thing try in problem use profiler. i've done that. results follows:

there hardly memory allocations in either of solutions. entirely dwarfed initial test data allocation, relatively small test data (i used 1m, 10m , 100m of integers when testing).
the work being done in e.g. parallel.for or parallel.foreach bodies themselves, not in code (the simple if (data[i] == 1) results.add(data[i])).

the first means can gc isn't culprit. indeed, doesn't chance run. second more curious - means in cases, overhead of parallel way out of line - it's seemingly random, works without hitch, , takes half second. point gc, we've ruled out already.

i've tried using overload without loop state, didn't help. i've tried limiting maxdegreeofparallelism, ever hurt things. now, obviously, code absolutely dominated cache access - there's hardly cpu work , no i/o - favour single-threaded solution; using maxdegreeofparallelism of 1 doesn't - indeed, 2 seems fastest on system. more useless - again, cache access dominates. it's still curious - i'm using server cpu tests, has plenty of cache of data @ once, , while we're not doing 100% sequential access (which pretty gets rid of latency entirely), should quite sequential enough. regardless, have base line of memory throughput in single-threaded solution, , it's close speed of parallelised case when works (parallelised, i'm reading 40% less runtime single-threaded, on four-core server cpu embarassingly parallel problem - again, obviously, memory access limit).

so, it's time check reference source parallel.for. in case this, creates ranges based on amount of workers - 1 range each. it's not ranges - there's no overhead that. core runs task iterates on given range. there's few interesting bits - example, task "suspended" if takes long. however, doesn't seem fit data - why cause random delays unrelated data size? no matter how small work job, , no matter how low maxdegreeofparallelism is, "random" slowdowns. might problem, have no idea how check it.

the interesting thing expanding test data nothing anomaly - while makes "good" parallel runs faster (even getting close perfect efficiency in tests, oddly enough), "bad" ones still bad. in fact, in few of test runs, absurdly bad (up ten times "normal" loop).

so, let's have @ threads. i've artifically bumped amount of threads in threadpool ensure expanding thread pool isn't bottleneck (it shouldn't if worked well, but...). , here comes first surprise - while "good" runs use 4-8 threads make sense, "bad" runs expand on available threads in pool, if there's hundred of them. oops?

let's dive source code once again. parallel internally uses task.runsynchronously run root partitioned work job, , waits on result. when @ parallel stacks, there's 97 threads executing loop body, , 1 has runsynchronously on stack (as expected - that's main thread). others plain threadpool threads. task ids tell story - there's thousands of individual tasks being created while doing iteration. obviously, very wrong here. if remove whole loop body, still happens, it's not closure weirdness either.

explicitly setting maxdegreeofparallelism offsets - amount of threads used doesn't explode anymore - however, amount of tasks still does. we've seen ranges amount of parallel tasks running - why keep creating more , more tasks? using debugger confirms - maxdop of four, there's 5 ranges (there's alignment causes fifth range). interestingly, 1 of completed ranges (how did first 1 finish ahead of rest?) has index higher range iterates - because "scheduler" assigns range-partitions in slices of 16.

the root task self-replicating, instead of explicitly starting e.g. 4 tasks handle data, waits scheduler replicate task handle more data. it's kind of hard read - we're talking complex multi-threaded lock-less code, seems always assigns work in slices smaller partitioned ranges. in testing, maximal size of slice 16 - far cry millions of data i'm running. 16 iterations body no time @ all, might lead many issues algorithm (the biggest being infrastructure taking more cpu work actual iterator body). in cases, cache trashing might impact performance further (perhaps when there's lot of variation in body runtimes), of time, access sequential enough.

tl; dr

don't use parallel.for , parallel.foreach if work-per-iteration short (on order of milliseconds). asparallel or running iteration single-threaded faster.

slightly longer explanation:

it seems parallel.for , paraller.foreach designed scenarios individual items you're iterating on take substantial amount of time execute (i.e. lots of work per item, not tiny amounts of work on lot of items). seem perform badly when iterator body short. if you're not doing substantial work in iterator body, use asparallel instead of parallel.*. sweetspot seems somewhere under 150ms per slice (around 10ms per iteration). otherwise, parallel.* spend tons of time in own code, , hardly time doing iteration (in case, usual number somewhere around 5-10% in body - embarassingly bad).

sadly, didn't find warning on msdn - there's samples going on substantial amounts of data, there's no hint @ terrible performance hit of doing so. testing same sample code on computer, i've found indeed slower single-threaded iteration, , @ best of times, barely faster (around 30-40% time savings while running on 4 cpu cores - not efficient).

edit:

willaien found mention on msdn issue, , how solve - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx. idea use custom partitioner , iterate on in parallel.for body (e.g. loop in parallel.for's loop). however, cases, using asparallel still better choice - simple loop bodies mean kind of map/reduce operation, , asparallel , linq in general great @ that. example, sample code rewritten as:

var result = testdata.asparallel().where(i => == 1).tolist();

the case using asparallel bad idea same other linqs - when loop body has side-effects. might tolerable, it's safer avoid them altogether.

Search This Blog

Core code