Hit The Bottleneck | How I Improved Speed : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/310/">Hit The Bottleneck | How I Improved Speed : 5ubliminal's TellinYa</a>
Must Reads: Web Scraping | Link Farming | Code Snippets | SEO Freeware
Reveal More!
Most of you will learn almost nothing from this post (so reading is optional :)) but you'll see the things I struggle with, on a daily basis ;) And later on I'll have a cool piece of software for you … not this one :)
I hit the speed bottleneck!

I worked and worked and finally realized my code was not fast enough. It's a serious scraping and text gathering software and it was built using WinHTTP and GRETA RegExp. All cool but I realized that I could not speed up downloading the 100 pages pack to less the 45-60 seconds with an average of 50 seconds using 25 threads and decent internet connection.

WinHTTP is excellent from one point of view. It has all the bells and whistles Microsoft ads to their libraries but has a twist. It can handle HTTPS which, in order to implement from scratch, even using the CryptoAPI, would be wasted time I can't afford right now.

After pondering for quite a while I realized I had to go through with a class I had half built. I had to write a HTTP client class from scratch (Sockets and HTTP Protocol). I did some tests and noticed an increase of speed of 10-30% if I ditched WinHTTP in favor of my own implementation.

I was sitting back and wondering: Is an average increase of 15% in performance worth the complete rewriting of the HTTP client which was a 2-3 hours job. I decided to do it and I had an extra reason for this:

You want HTTP downloading speed: GZip and Deflate!

These are the magic words. By adding the Accept: gzip, deflate in HTTP requests you usually receive (from compatible web servers) over less than 50% of the data you get when browsing plain text. Most times it shrinks over 75% so time also fail dramatically. But there is one small problem. WinHTTP does not swallow GZip or Deflate and, either using WinHTTP or My Own HTTP Client, I reached the same conclusion: The decompression had to be added by me!

No to be bragging but to me, compressing and decompressing are two packs functions: Compress, Uncompress, GZEncode, GZDecode as I already had written them over one year ago. They just needed to blend with the HTTP client of choice. About 3 hours later I had a functional HTTP client which reduced download time from and average of 50seconds to and average of 15seconds. And this is where it ended.

I didn't not use cURL as I hate code clutter. I never use BOOST or cURL and such too complex and large libraries. I write everything myself and keep things small :) A beast that could smoke the average CPU is smaller then 400KB in size.
Really happy with results: Next Step!

Regular Expressions are sent from above but there's one rule to abide: When you have long data and easy rules … go for the hand-job. By re-implementing some of the regular-expressions I used in the code I reduced RegExp processing time to half.

Drive Space issue!

Each keyphrase outputs on average 200KB XML file with … stuff! After GZEncoding I have virtually same performance but I cut drive space cost to 30%.

Let's sum it up!
Testing 100 keyphrases + 100 pages = 10.000 HTML files to download and process. Each keyphrase outputs and average of 750-1250 viable sentences for further use. Old algo took about 2.5 - 3 hours of CPU and Internet abuse while the new algo 0.9 - 1.1 hours of CPU and Internet abuse with 30% of old disk space use!

I can say I'm really happy with the results. And code is stable … about 10hours of run-time later :)

And the lesson of the day is:

  • Avoid WinINET and WinHTTP if you NEED performance, don't need HTTPS connections and you can write your own http client.
  • Always implement GZip decompression on HTTP Transfers.
  • Keep RegExp to a minimum and any simple rules and be quickly written using basic string functions such as the str* in C++.

I have an Extreme QX6700 Quad-Core so my times may be in high discrepancy with other's. Have a friend who heard the CPU over-heating alarm sound for the first time because he ran 4 concurrent instances on a different Q6600 Quad-Core.

8 Comments Posted By Readers :

Add your comment
#1 m0nkeymafia from Great Britain
Posted on Tuesday, 05 February, 2008
Nice work, I presume you used blocking sockets?

When I wrote my HTTP class it was originally written with non blocking, as they had already been written. But I was only getting 10% of the throughput I had hoped for, a re-write of the socket class to use blocking sockets, and a bit of a logic re-write and saw a MASSIVE speed increase.
#2 5ubliminal web
Posted on Tuesday, 05 February, 2008
It's either blocking or multithreaded.
It depends. When you just need downloads you can use blocking and virtual functions for update progress. You the override them and use status updates where you need.
Non-blocking are best for multithreaded servers. I rearly use them in client sockets. Never need and async operations are more difficult to watch over.

PS: When you use blocking don't forget to set timeouts with:
- setsockopt(hSocket, SOL_SOCKET | IPPROTO_TCP, SO_SNDTIMEO, (CHAR*)&Timeout, sizeof(int);
- setsockopt(hSocket, SOL_SOCKET | IPPROTO_TCP, SO_RCVTIMEO, (CHAR*)&Timeout, sizeof(int);
(Timeout=int in miliseconds) or it will freeze forever. ;)
#3 m0nkeymafia from Great Britain
Posted on Tuesday, 05 February, 2008
Interesting, I dont actually set the timeout there, I choose timeout when trying to accept an incoming socket, and when reading from it:

int retVal = select(FD_SETSIZE, &readSet, 0, &errorSet, &timeout);

I tend to use a single threaded server, that iterates through each socket [of which each socket is its own state machine]. Seems to work best for what I need it for.

Have you found that running X sockets, each with its own thread, is better than running 1 at a time at full power? I'm guessing given that your running a 4 core processor it helps a lot? Unfortunately I can't design for parrellism as the code has to work on shite 1 core 1Ghz celeron processors lol.
#4 5ubliminal web
Posted on Tuesday, 05 February, 2008
For servers yes. But I haven't worked with servers for two years or so. Except a proxy tester … basic!
Single thread is good. Multithreaded is better but keep thread count to one per CPU.
IOCP is the shit! - checkout codeproject and look about it.

I use berkley sockets only. No WSA stuff. socket, connect, setblocking, settimeouts, send / receive, close. I'll talk about sockets here a bit soon.
#5 PR from United States
Posted on Tuesday, 05 February, 2008
I'm just finishing up my implementation of a non blocking http client. For my uses non blocking is a must. With multiple blocking threads using a 3rd party http client things become problematic above 500 threads just due to context switching and wasted memory. With non blocking I can easily handle 2000 to 3000 connections with 4 threads (1 for opening connections and doing writes, 3 for doing reads) And, yes, GZip is a must.

I have found this to be ideal for a large selection of random web sites that may be slow both in opening the connection and in putting data into the pipe. With these type of sites you gain a lot more by not needing an entire thread to babysit every connection (only downloading one page at a time from any one site). Also works fantastically with inventory.overture.c*m where they may take 45 - 90 seconds to finish opening your connection, but will service hundreds of simultaneous connections (continuously for months). If you raise your connect timeout and separate connect requests by at least 100ms you can squeeze a lot of data out of them. (no wonder it has been so unreliable, they do nothing to keep the bh's from draining their resources dry)

Anyway, my tasks don't take a ton of CPU so I'd just rather get my pages faster. And if I needed more CPU I'd just rent an extra large EC2 server from amazon for $0.80 per hour, and still keep getting my pages fast.

A general purpose http client build to be all things to all people, can never compete with a clean home-grown version that is designed for a special purpose.
#6 5ubliminal web
Posted on Tuesday, 05 February, 2008
Excellent comment. It's great to heave readers that comment and say smart things :) I'll write a new implementation like this and I'll tell you how it works.
I have a funny feeling that you are right ;)
#7 blackhat seo from Greece web
Posted on Friday, 08 February, 2008
Why reinvent the wheel over and over? There's cURL.
#8 5ubliminal web
Posted on Saturday, 09 February, 2008
For me BOOST and cURL are out of question. The only 3rd party libraries I use is crypto++, greta (RegExp), zlib and the image libraries (which you can't rewrite).
I write clean code, clutter-free, and like to keep it that way.

The moment I saw the source code of those big-a$$ libraries I knew they can't be used. I'm not really a STL guy.
I have all the lists, vectors, double linkeds, maps, string classes and so much more written during many (7) years in my own library which has just exceeded 200.000 lines of code with very little GUI algo.

Instead of using cURL to get a FTP file I rather use the clean WinINET or write a FTP client (10 minute job).

PS: I know I'm a bit 'nuts' but I got the code to back me up and keep me dependency free of the big libraries and I add at least 1000 lines daily to the library.
Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEO & SEM Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Sunday, 06 July, 2008 - 04:16:16 GMT]   No Ajax / Flash Used Here
" Hit The Bottleneck | How I Improved Speed : 5ubliminal's TellinYa "