I’ve had a problem where I’ve needed to clean some stuff from files and then upload them to the server. Cleaning part was fast, for more than a hundred of files (2 GB in size) it took less than a minute. Problem was with upload. It took several hours.
So, I’ve thought, let’s try doing it in parallel. I’ll launch upload on so many threads as I have, and my workstation has a Xeon, it should help. It doesn’t use much bandwidth, so I don’t need to worry about clogging my network connection. It’ll be fine. It wasn’t. Luckily.
Turned out our server can’t handle so many simultaneous connections. Let’s limit number of connections, was my next thought. And then it’s struck me. Why do I upload whole file again when actual cleanup changes just small percentage of lines, less than 10% in most cases? Not to mention that more than half of files doesn’t require cleanup at all.
I’ve changed my code for cleanup, so it would output only lines which were cleaned to the files. Total size went down to 8 MB! And as you can imagine upload isn’t an issue anymore, even using single thread. Entire process takes less than 5 minutes now!
If you don’t understand the problem well don’t go parallel. It’ll make it worse, or at least hide the real problem till you’ll have more input data.
Another lesson I’ve got from this experience is the value of code reviews. Although I can’t count on it as I’m sole engineer, I fully understand how important they are. I have no doubt this problem would’ve been solved much earlier if I would just have a second pair of eyes checking the code.