5ubliminal@twitter

5ubliminal's KeyDetox - SEO Keyphrase List Cleaner Tool : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/339/">5ubliminal's KeyDetox - SEO Keyphrase List Cleaner Tool : 5ubliminal's TellinYa</a>
5ubliminal's YAMS
Bug Fix: Fixed a small bug that did not remove shuffles properly. Now they're gone.
If you have problems starting it make sure you have VS2005SP1 Redist installed!

Been really busy these past few days. Had to rewrite two pieces of code I've been postponing for quite a while: FTP Client and ZIP Compression. When I ditched WinINET in favor of WinHTTP I lost their integrated FTP Client. So I had to rewrite it from scratch and wrote the ZIP archive creating code based on ZLib compression. I got WinZIP/WinRAR (I mean they work:)) compatible archive creation in fewer lines of code then anyone else:) So, as I feel great, I got a treat for you … a tool I finished yesterday.

XCuse any spelling errors:) Wrote this in a hurry as I got a lot of coding left to do today but wanted this out now.
New SEO Keywords' Software Tool!

Sience Dan pointed out the MS Ad Intelligence Excel plugin, I've been so much into it that … well, I can say in terms of software Microsoft is the sh|t. The tool is so easy to use and they are so honest regarding their traffic. They show you all that you need there.

But I'm not here to talk about the MS tool, not right now, but I'm here to talk about my tool. Along with the excellent keywords suggested by Ad Intelligence there came a problem. All my other keywords' tools have internal cleaning filters applied as data was received from the source. The problem is I had no tool to clean an external list of keywords. And my team needed something like this too.

So I took a few hours (around 7) to create this tool. 3 hours of fine-tuning … = 10 hours job and I'm rather pleased with the results. I'll have few additions for it in the next days but for now it's all I needed. I'll explain you how it works, looks and the end goal in detail from here on so stay sharp.

What's this tool all about?

There's a certain characteristic you need to learn about my tools:

Remember FUBU: For Us By Us? My software is FMBM:The are made For Me By Me.

This means I build stuff for personal use and sometimes I share. But it's not very easy to get a grip on them quickly. But they look so easy to me:)

The keywords suggested are sometimes dirty. Dirty as in: containing unrelated words or site names you don't need to use. They can also contain unnecessary punctuation or duplicate words. So these lists need to be cleaned and this tool will do just that. It has several rules you can apply plus a set of RegExp rules to make your keyword list shiny new. See the steps you need to follow to use the Key-Detox properly below.

Importing Keywords

To import keywords you need to drag and drop .txt files on the interface or use the menu to import from Clipboard. The duplicates are removed on Import and as you will notice, the window is built to stay on top so you can drag and drop easily on it. When the window looses focus it becomes translucent and you can see what you drag and drop better. You can also select and manually delete keywords from the list with Right Click.

Understanding Indexes

Indexes are words that can be found in keyphrases. The keyphrases you add are split in words and they occurences are counted. When you checkout the indexes section you will see, separated by first-letter, the words that can be found in your keyphrases and how many times they appear. Data is updated in real time so everytime you alter the main keyphrase list the indexes are recalculated.

You will look at them and find those unappropriate for your keywords list. You will right click them (multiple selection works) and press Add To Stop Words.

Understanding Stop Words

Stop Words are a list of words, one per line, that will render keyphrases containing them invalid hence delete them. So look in the Indexes and add words like webpage names, phone numbers, funny words and letters and so on.

Right-Side Exclusion Rules

First set of exclusion rules is seen on the right. There are some buttons there. I won't explain you what they do here but do keep in mind those are checkboxes. Pressed = ON! Each of those has Tooltips with the detailed explanation of functionality so if you're interested you will download the software and use it.

Configuring the RegExp Exclusion Rules

This is by far the coolest and most advanced section of this software. Here you can replace with some RegExp knowledge all the other rules you can set in the interface. If you look at the rules you will see each starts with one of: ?-+. You must start a RegExp with one of those signs for it to take effect. If your RegExp skills are rusty the software will kindly notify you:)

The + Preceeded Rules:
These rules are mandatory and all keyphrases left after processing will have to abide by. So the final list will validate against all the RegExps you set with a leading +. I have two set by default. 1st forces only words containg alphanumerics and space to be valid and 2nd forces valid words to have between 10-25 letters.

The - Preceeded Rules:
These rules are bans and all keyphrases left after processing will never abide by. So the final list will not validate against all the RegExps you set with a leading -. I have two set by default. 1st forces only keyphrases containg words longer than 2 letters and 2nd forces valid keyphrases to contain words shorter than 15 letters.

The ? Preceeded Rules:
These rules are optionals and all keyphrases left after processing will abide at least one of them. So if you set 5 rules with ? left keyphrases will validate at least one. There's a catch: you can split optionals into packs. Let's say you want only words that start in C or end in A and have 11-12 or 19-20 characters. This set of rules can not be achieved using - or + rules and not even by optionals on their own. So I created option packs like this: ?[#PackID]RULE packid is a number and is used to group rules but if it's not found it defaults to 0. So:

?^c
?a$
?[1]^(.{11,12})$
?[1]^(.{19,20})$
means:
?[0]^c
?[0]a$
?[1]^(.{11,12})$
?[1]^(.{19,20})$

If you have more optional packs then at least one of each pack has to validate allowing you to set combined rules unachievable with exclusion (-) or inclusion (+) rules. I'm sure I've just squashed your brains but … I can't explain this better as it looks so easy :) It's sick!

With the right combination of RegExps you can achive any goal!
The Menu:

It's all obvious except first SubMenu. Reset List will restore the list to the moment before the process. All keyphrases imported are stored internally and once you make mistakes and exclude too many using your rules you can use Reset and start over without dragging and dropping files with keyphrases again. The Clear List clears the list you see and the internal one. You can't Reset anymore and you have to import new.

Downloading the damn thing:

You can use this tool for the simplest goal to remove duplicates from sets of keyphrases or to apply advanced rules on sets of phrases. By using RegExp you can filter only words that start with a word or end with a word or contain one of but none of and so on. Immagination is the limit if I may say. This tool is ASCII (a-z and the likes) character friendly. No guarantee for anything else. Serious UNICODE and RegExps … tough.

And if you got questions / spot bugs / got ideas - shoot. I'm listenting and ready to help:)

You've finally got to the download section. Click here to download the tool. Do keep in mind I take tips.

10 Comments Posted By Readers :

Add your comment
#1 Papa Rage from United States
Posted on Thursday, 03 April, 2008
I'm impressed. I'm importing keywords into it now. This is like something I'd write :)

but maybe I shouldn't have tried to import 1,900,000 keywords all at once. Been about 5 minutes.

going to get some more coffee.
#2 5ubliminal web
Posted on Thursday, 03 April, 2008
@PR: Oh GOD. This is meant to get themed keywords out of it. Not to clean huge lists.
I didn't even add a Cancel button as 50.000 keywords work few seconds but didn't imagine someone woudl add millions.

Damn … you'll need some RedBull too.
#3 Papa Rage from United States
Posted on Thursday, 03 April, 2008
It's ok, only one of my cores is pegged and the 1GB of ram it's playing with leaves me plenty of room to get my other work done.

Frankly, the only keyword list I have atm that needs to be cleaned up is that big one. there tons of '.com's and foreign language keywords and dups and all the standard cruft.

Last year I ran this entire list through overture suggestion tool and saved off the suggestions as a poor mans way of cleaning the list. Unfortunately they only suggest keywords that have more than 20 searches in the last month. Of course they don't have the best data, so a ton of good keywords never got through.

the ad Intelligence tool is much much better than the overture tool so i need to clean up this list.

I had no qualms about 'leveraging' overture, they were happy to service my 500-1000 requests per hour from the same ip. but since you need to log into a MS server for adintelligence I need to be more careful with them.

45 minutes, still going. I have these keywords also in 2000+ small files. I could drag them in 10 at a time or so, but I hate repetitive work. That's why I write software! I'll just wait it out :)
#4 5ubliminal web
Posted on Thursday, 03 April, 2008
I'm afraid it'll crash eventually. I could help you get them into a MDB (Access Database) and clean them on the way if you want.
This tool handles all in memory and I have no idea how it would work with millions of keywords.
I got mine nicely setup in Access databases which I SQL query to extract.

So tell me how you got the keywords setup on your drive (all in one folder, are files named especially somehow) and I'll send you a tool to fix your problem:)
#5 Papa Rage from United States
Posted on Thursday, 03 April, 2008
Yeah it crashed. I've got them in a MySQL database organized by category, also in 2000+ category txt files, also one big txt file, also in couple sheets in an excel file (sucks that you can't have more than 1 million rows in an excel sheet)

Now I'm using your tool with about 50,000 keywords to figure out what rules and exclusions I want to use. Once I have that figured out I may have to write something myself to handle the bulk of the data. Unless you're going to release a command line process mode that can handle large files.

The transparency and always on top is most helpful, and the way you have this organized is very insightful. The double-byte character encoded output of your export function isn't handled well by my text editor, tho it shows up correctly in notepad and wordpad.
#6 5ubliminal web
Posted on Thursday, 03 April, 2008
I have everything in UNICODE so text files are saved in UTF16 also. That's a good idea. As I weed out the UNICODE chars I could save it as ASCII :)
I just updated with an ASCII saving version.
You should wait untill tomorrow. Fine tune your RegExps today and I'll have an extra feature to handle keyphrases as text files and apply rules to them by tomorrow.

Thinking twice bulk processing won't help you much. RegExp rules will not remove parasite keywords like misspells and such words. The best way is to get batches and process them.
If you would process text files you will loose one major feature: INDEXES. In Indexes you get to find words that will pass regexp filters but should not be there.
And index filters are very important to make sure you list is clean and relevant.

A word like buyonlinecom could pass regexp filters but could mess you content and provide relevant support (links) to the site that injected the fake query in OVERTURE. And there are loads of those.
They can only be found in indexes. So … if you got a solution for this except a bit of manual labour … tell me.
#7 all0nym from United States
Posted on Thursday, 03 April, 2008
Awesome tool. Thanks for sharing it.
#8 5ubliminal web
Posted on Friday, 04 April, 2008
Glad you like it;)
#9 nogenius from United States
Posted on Friday, 04 April, 2008
Very nice app... really inspires me to continue learning a language capable of creating desktop applications (have done a bit of Java in the past).

On that side note, I know you are a pro/veteran at C++, but if you could start programming all over again, would you still choose C++, or would you go with another language, and why?
#10 5ubliminal web
Posted on Saturday, 05 April, 2008
@nogenius: Thanks.
I could guess you ask this question to get some advice. I always like to take the hard way. C++ is difficult especially with no MFC or ATL like me. But it is soooo fast.
But it takes a lot to learn and put your libraries together. If I were to start now I would also choose C++. Need for speed.

But if you were to start … think about C#. Much shorter learning curve. A lot of stuff already done and available for you.
You would take a shortcut, a bit slower software but untill you begin coding multithreaded apps like me … it'll take a while.

Regards.
Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Friday, 21 November, 2008 - 09:55:41 GMT]   No Ajax / Flash Used Here
" 5ubliminal's KeyDetox - SEO Keyphrase List Cleaner Tool : 5ubliminal's TellinYa "