… this will be short and simple! I want to express a big Thank You to Papa Rage(PR). You can see his comment on the post about my bottleneck. He told me speed could be much improved by using non-blocking sockets and … this is how it went:
I complained about downloading the 100-page pack in 45-60 seconds. Then I took it to 15-30 seconds by dumping WinHTTP. Then I took it to 10 seconds by dumping blocking sockets and using non-blocking ones. But, even if they mostly downloaded in 10-20 seconds I realized that, every now and then, they took over 1 minute to download.
1 minute is totally unacceptable no matter how you look it. So I went out to time my code using Performance Counters in terms of microseconds. And all the downloading went smoothly.
After paying more attention I realized that the entire blocking was in the DNS Resolving. gethostbyname, gethostbyaddr and getaddrinfo and as slow as it gets (almost dead) as they don't take a timeout. Some hosts took up to 15 seconds to fail resolving. Holy sheet! I then looked at all the results and noticed that all valid hosts resolve in more less then one second. But there's a catch! You are limited to the cache on your local DNS server. So first time they need to resolve may take a bit more (less then 2.5 seconds - if longer it fails 99%). As they do not have it in cache, they also need to ask around.
I rewrote the DNS resolving using UDP to query the A RECORDS (IPs) and added a new parameter named timeout in miliseconds. I also implemented my own DNS caching system locally. Then I went further and everytime I queue a pack of URLs I send a request to my assigned DNS servers to resolve without waiting a reply. So I just ask the DNS the IP of the domain. I don't wait for it but in the few miliseconds before the download actually starts, the DNS server will have cached the IP so second time I ask and actually wait for a reply … it already knows it. Speed greatly improved!
If a host fails to resolve in less then 1 second after preemptive cache … I consider it too slow and disregard it. And I have an around 5% failure rate on 100packs and they don't even load in browsers.
No matter how you speed things up about scraping you still hit the DNS slowness barrier. And that barrier is one tough nut to crack! For speed you need to go low level. You need to go all the way! DNS functions in Windows work on same concept as the slow ones for resolving hosts. So using DNS Windows functions is not an option and DNS query has to be rewritten from scratch. This will help you get rid of non-reponsive hosts that timeout very slow. Even if most resolve extremely fast either way!
Thanks to PR's comment and the DNS caching and Low Level DNS Querying I manage to download 100packs in 4-5 seconds with only 4 threads (one per CPU) and it's lightning fast and all my 4-Cores work at 50-80% during downloads :)! … But I fixed this with a very short Sleep();