5ubliminal@twitter

PR (Papa Rage) - You're The Man | But I Took This To Another Level ;) : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/315/">PR (Papa Rage) - You're The Man | But I Took This To Another Level ;) : 5ubliminal's TellinYa</a>
5ubliminal's YAMS
I said I would stop posting for a short while, but …

… this will be short and simple! I want to express a big Thank You to Papa Rage(PR). You can see his comment on the post about my bottleneck. He told me speed could be much improved by using non-blocking sockets and … this is how it went:

Non Blocking Sockets!

I complained about downloading the 100-page pack in 45-60 seconds. Then I took it to 15-30 seconds by dumping WinHTTP. Then I took it to 10 seconds by dumping blocking sockets and using non-blocking ones. But, even if they mostly downloaded in 10-20 seconds I realized that, every now and then, they took over 1 minute to download.

1 minute is totally unacceptable no matter how you look it. So I went out to time my code using Performance Counters in terms of microseconds. And all the downloading went smoothly.

Where the hell was it blocking!

After paying more attention I realized that the entire blocking was in the DNS Resolving. gethostbyname, gethostbyaddr and getaddrinfo and as slow as it gets (almost dead) as they don't take a timeout. Some hosts took up to 15 seconds to fail resolving. Holy sheet! I then looked at all the results and noticed that all valid hosts resolve in more less then one second. But there's a catch! You are limited to the cache on your local DNS server. So first time they need to resolve may take a bit more (less then 2.5 seconds - if longer it fails 99%). As they do not have it in cache, they also need to ask around.

How did I fix all this?

I rewrote the DNS resolving using UDP to query the A RECORDS (IPs) and added a new parameter named timeout in miliseconds. I also implemented my own DNS caching system locally. Then I went further and everytime I queue a pack of URLs I send a request to my assigned DNS servers to resolve without waiting a reply. So I just ask the DNS the IP of the domain. I don't wait for it but in the few miliseconds before the download actually starts, the DNS server will have cached the IP so second time I ask and actually wait for a reply … it already knows it. Speed greatly improved!

If a host fails to resolve in less then 1 second after preemptive cache … I consider it too slow and disregard it. And I have an around 5% failure rate on 100packs and they don't even load in browsers.

What do we learn here?

No matter how you speed things up about scraping you still hit the DNS slowness barrier. And that barrier is one tough nut to crack! For speed you need to go low level. You need to go all the way! DNS functions in Windows work on same concept as the slow ones for resolving hosts. So using DNS Windows functions is not an option and DNS query has to be rewritten from scratch. This will help you get rid of non-reponsive hosts that timeout very slow. Even if most resolve extremely fast either way!

Thanks to PR's comment and the DNS caching and Low Level DNS Querying I manage to download 100packs in 4-5 seconds with only 4 threads (one per CPU) and it's lightning fast and all my 4-Cores work at 50-80% during downloads :)! … But I fixed this with a very short Sleep();

5 Comments Posted By Readers :

Add your comment
#1 emonk from United States
Posted on Friday, 08 February, 2008
Depending on what your doing, you might also want to look into HTTP pipelining. It REALLY made a huge difference for me.
#2 Papa Rage from United States
Posted on Friday, 08 February, 2008
That's great to hear, kudos on implementing it quickly. Slow DNS resolution can be a real pain, especially when it's handled in someone elses code, and the handling is not to your liking.
#3 5ubliminal web
Posted on Friday, 08 February, 2008
@emonk: I have not used this before but it seems requsts sent have to be for the same server.
I'm actually scraping SERPs so I'll look into this rather soon when I'll implement an all-purpose full-site scraper spider.
#4 5ubliminal web
Posted on Friday, 08 February, 2008
@Papa Rage:
10x (=tenx=thanks). It indeed did miracles.
PS: Good to know PR doesn't come from Page Rank:)
#5 Papa Rage from United States
Posted on Friday, 08 February, 2008
@emonk

Pipelining is another fantastic way to improve speed. When implementing your own non blocking http client it is very expensive to implement when compared to the rest of the client. Expensive in terms of memory: all of a sudden your need for write buffers jump up from one per thread to one per connection (worst case) because the socket may not be ready to accept all your data in one shot.
Expensive in terms of complexity: you can pipeline your requests before you get your first response, but the host is allowed to close the connection at any time and you have to be prepared to retry everything in your pipeline.

However this is one of those expensive things that's well worth the investment.
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Friday, 12 March, 2010 - 06:06:23 GMT]   No Ajax / Flash Used Here
" PR (Papa Rage) - You're The Man | But I Took This To Another Level ;) : 5ubliminal's TellinYa "
Close
Tellinya.com is relocating to blog.5ubliminal.com. This blog is no longer maintained and comments are no longer accepted / answered.