
Most of you will learn almost nothing from this post (so reading is optional :)) but you'll see the things I struggle with, on a daily basis ;) And later on I'll have a cool piece of software for you … not this one :)
I worked and worked and finally realized my code was not fast enough. It's a serious scraping and text gathering software and it was built using WinHTTP and GRETA RegExp. All cool but I realized that I could not speed up downloading the 100 pages pack to less the 45-60 seconds with an average of 50 seconds using 25 threads and decent internet connection.
WinHTTP is excellent from one point of view. It has all the bells and whistles Microsoft ads to their libraries but has a twist. It can handle HTTPS which, in order to implement from scratch, even using the CryptoAPI, would be wasted time I can't afford right now.
After pondering for quite a while I realized I had to go through with a class I had half built. I had to write a HTTP client class from scratch (Sockets and HTTP Protocol). I did some tests and noticed an increase of speed of 10-30% if I ditched WinHTTP in favor of my own implementation.
I was sitting back and wondering: Is an average increase of 15% in performance worth the complete rewriting of the HTTP client which was a 2-3 hours job. I decided to do it and I had an extra reason for this:
These are the magic words. By adding the Accept: gzip, deflate in HTTP requests you usually receive (from compatible web servers) over less than 50% of the data you get when browsing plain text. Most times it shrinks over 75% so time also fail dramatically. But there is one small problem. WinHTTP does not swallow GZip or Deflate and, either using WinHTTP or My Own HTTP Client, I reached the same conclusion: The decompression had to be added by me!
No to be bragging but to me, compressing and decompressing are two packs functions: Compress, Uncompress, GZEncode, GZDecode as I already had written them over one year ago. They just needed to blend with the HTTP client of choice. About 3 hours later I had a functional HTTP client which reduced download time from and average of 50seconds to and average of 15seconds. And this is where it ended.
I didn't not use cURL as I hate code clutter. I never use BOOST or cURL and such too complex and large libraries. I write everything myself and keep things small :) A beast that could smoke the average CPU is smaller then 400KB in size.
Regular Expressions are sent from above but there's one rule to abide: When you have long data and easy rules … go for the hand-job. By re-implementing some of the regular-expressions I used in the code I reduced RegExp processing time to half.
Each keyphrase outputs on average 200KB XML file with … stuff! After GZEncoding I have virtually same performance but I cut drive space cost to 30%.
Testing 100 keyphrases + 100 pages = 10.000 HTML files to download and process. Each keyphrase outputs and average of 750-1250 viable sentences for further use. Old algo took about 2.5 - 3 hours of CPU and Internet abuse while the new algo 0.9 - 1.1 hours of CPU and Internet abuse with 30% of old disk space use!
I can say I'm really happy with the results. And code is stable … about 10hours of run-time later :)
I have an Extreme QX6700 Quad-Core so my times may be in high discrepancy with other's. Have a friend who heard the CPU over-heating alarm sound for the first time because he ran 4 concurrent instances on a different Q6600 Quad-Core.
Post Feedback