5ubliminal@twitter

Extract Domain Name From Host Name In PHP : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/255/">Extract Domain Name From Host Name In PHP : 5ubliminal's TellinYa</a>
Must Reads: Web Scraping | Link Farming | Code Snippets | SEO Freeware » I'm back to work … sorting my shit now :)
Reveal More!
First you need to understand why?
Someone said it's allready in PHP but he's slightly wrong. parse_url will, indeed, bring you the hostname! But between hostname and domain name there's a huge difference.

For http://www.asite.co.uk/ parse_url will return www.asite.co.uk as hostname but the domain name is : asite.co.uk. And the list you'll find here is quite updated including centranic's uk.com and such TLDs and 2nd Level TLDs. Read what these are below!

I had to consider PHP coders also!

Over a month ago I published the Domain From URL in C++. Same theory applies here but this code is in PHP. I think some of you may find good use for it! The code will extract Domain Name from Host Name or URL and will Contain Domain, TLD and maybe a Country Level TLD (2nd level TLD).

As some will noticed I have a different approach in the PHP version compared to the C++ one. I don't use a big chunk of text but I split it on Top Level Domain (TLDs). This is done to make it as fast as it should be! The string matching is done on very small slices. I enclosed the text between ,(comma) and use ,(comma) as a separator. This way I won't miss the first entry, last entry and I won't find partial matches and it will use on single strstr call to work! Any other way would have needed more than just one function.

Converting URL (Host) To Domain Name PHP Function

<?
/* I decided to instantiate this array outside the function. I'm not sure how PHP works
but this has to be faster than initializing the array everytime it will be used! */
$countryLevelTLDs=array(); // The MEGA array of country level TLDs ... edit as you need!
/*
I put this togther in a peculiar way. You will see it starts, ends and separates with , (comma)
Why I did this? I wanted to reduce execution fat as much as posible.
If you edit the array always make sure each entry is split by , and it starts and ends in ,
This is to ensure the function won't find partial matches or won't miss the first or last term!
I also wanted to reduce all this to a quick string match and no RegExp or complex code!
-- It need to be fast :)
*/
$countryLevelTLDs['ac']=',com,edu,gov,net,mil,org,';
$countryLevelTLDs['ae']=',com,net,org,gov,ac,co,sch,pro,';
$countryLevelTLDs['ai']=',com,org,edu,gov,';
$countryLevelTLDs['ar']=',com,net,org,gov,mil,edu,int,';
$countryLevelTLDs['at']=',co,ac,or,gv,priv,';
$countryLevelTLDs['au']=',com,gov,org,edu,id,oz,info,net,asn,csiro,telemem';
$countryLevelTLDs['au'].='o,conf,otc,id,';
$countryLevelTLDs['az']=',com,net,org,';
$countryLevelTLDs['bb']=',com,net,org,';
$countryLevelTLDs['be']=',ac,belgie,dns,fgov,';
$countryLevelTLDs['bh']=',com,gov,net,edu,org,';
$countryLevelTLDs['bm']=',com,edu,gov,org,net,';
$countryLevelTLDs['br']=',adm,adv,agr,am,arq,art,ato,bio,bmd,cim,cng,cnt,c';
$countryLevelTLDs['br'].='om,coop,ecn,edu,eng,esp,etc,eti,far,fm,fnd,fot,fs';
$countryLevelTLDs['br'].='t,ggf,gov,imb,ind,inf,jor,lel,mat,med,mil,mus,net';
$countryLevelTLDs['br'].=',nom,not,ntr,odo,org,ppg,pro,psc,psi,qsl,rec,slg,';
$countryLevelTLDs['br'].='srv,tmp,trd,tur,tv,vet,zlg,';
$countryLevelTLDs['bs']=',com,net,org,';
$countryLevelTLDs['ca']=',ab,bc,mb,nb,nf,nl,ns,nt,nu,on,pe,qc,sk,yk,gc,';
$countryLevelTLDs['ck']=',co,net,org,edu,gov,';
$countryLevelTLDs['cn']=',com,edu,gov,net,org,ac,ah,bj,cq,gd,gs,gx,gz,hb,h';
$countryLevelTLDs['cn'].='e,hi,hk,hl,hn,jl,js,ln,mo,nm,nx,qh,sc,sn,sh,sx,tj';
$countryLevelTLDs['cn'].=',tw,xj,xz,yn,zj,';
$countryLevelTLDs['co']=',arts,com,edu,firm,gov,info,int,nom,mil,org,rec,s';
$countryLevelTLDs['co'].='tore,web,';
$countryLevelTLDs['cr']=',ac,co,ed,fi,go,or,sa,';
$countryLevelTLDs['cu']=',com,net,org,';
$countryLevelTLDs['cy']=',ac,com,gov,net,org,';
$countryLevelTLDs['dk']=',co,';
$countryLevelTLDs['do']=',art,com,edu,gov,gob,org,mil,net,sld,web,';
$countryLevelTLDs['dz']=',com,org,net,gov,edu,ass,pol,art,';
$countryLevelTLDs['ec']=',com,edu,fin,med,gov,mil,org,net,';
$countryLevelTLDs['ee']=',com,pri,fie,org,med,';
$countryLevelTLDs['eg']=',com,edu,eun,gov,net,org,sci,';
$countryLevelTLDs['er']=',com,net,org,edu,mil,gov,ind,';
$countryLevelTLDs['es']=',com,org,gob,edu,nom,';
$countryLevelTLDs['et']=',com,gov,org,edu,net,biz,name,info,';
$countryLevelTLDs['fj']=',ac,com,gov,id,org,school,';
$countryLevelTLDs['fk']=',com,ac,gov,net,nom,org,';
$countryLevelTLDs['fr']=',asso,nom,barreau,com,prd,presse,tm,aeroport,asse';
$countryLevelTLDs['fr'].='dic,avocat,avoues,cci,chambagri,gouv,greta,medeci';
$countryLevelTLDs['fr'].='n,notaires,pharmacien,port,veterinaire,';
$countryLevelTLDs['ge']=',com,edu,gov,mil,net,org,pvt,';
$countryLevelTLDs['gg']=',co,org,sch,ac,gov,ltd,ind,net,alderney,guernsey,';
$countryLevelTLDs['gg'].='sark,';
$countryLevelTLDs['gr']=',com,edu,gov,net,org,';
$countryLevelTLDs['gt']=',com,edu,net,gob,org,mil,ind,';
$countryLevelTLDs['gu']=',com,edu,net,org,gov,mil,';
$countryLevelTLDs['hk']=',com,net,org,idv,gov,edu,';
$countryLevelTLDs['hu']=',co,erotika,jogasz,sex,video,info,agrar,film,kony';
$countryLevelTLDs['hu'].='velo,shop,org,bolt,forum,lakas,suli,priv,casino,g';
$countryLevelTLDs['hu'].='ames,media,szex,sport,city,hotel,news,tozsde,tm,e';
$countryLevelTLDs['hu'].='rotica,ingatlan,reklam,utazas,';
$countryLevelTLDs['id']=',ac,co,go,mil,net,or,';
$countryLevelTLDs['il']=',co,net,org,ac,gov,muni,idf,';
$countryLevelTLDs['im']=',co,net,org,ac,gov,nic,';
$countryLevelTLDs['in']=',co,net,ac,ernet,gov,nic,res,gen,firm,mil,org,ind,';
$countryLevelTLDs['ir']=',ac,co,gov,id,net,org,sch,';
$countryLevelTLDs['je']=',ac,co,net,org,gov,ind,jersey,ltd,sch,';
$countryLevelTLDs['jo']=',com,org,net,gov,edu,mil,';
$countryLevelTLDs['jp']=',ad,ac,co,go,or,ne,gr,ed,lg,net,org,gov,hokkaido,';
$countryLevelTLDs['jp'].='aomori,iwate,miyagi,akita,yamagata,fukushima,ibar';
$countryLevelTLDs['jp'].='aki,tochigi,gunma,saitama,chiba,tokyo,kanagawa,ni';
$countryLevelTLDs['jp'].='igata,toyama,ishikawa,fukui,yamanashi,nagano,gifu';
$countryLevelTLDs['jp'].=',shizuoka,aichi,mie,shiga,kyoto,osaka,hyogo,nara,';
$countryLevelTLDs['jp'].='wakayama,tottori,shimane,okayama,hiroshima,yamagu';
$countryLevelTLDs['jp'].='chi,tokushima,kagawa,ehime,kochi,fukuoka,saga,nag';
$countryLevelTLDs['jp'].='asaki,kumamoto,oita,miyazaki,kagoshima,okinawa,sa';
$countryLevelTLDs['jp'].='pporo,sendai,yokohama,kawasaki,nagoya,kobe,kitaky';
$countryLevelTLDs['jp'].='ushu,utsunomiya,kanazawa,takamatsu,matsuyama,';
$countryLevelTLDs['kh']=',com,net,org,per,edu,gov,mil,';
$countryLevelTLDs['kr']=',ac,co,go,ne,or,pe,re,seoul,kyonggi,';
$countryLevelTLDs['kw']=',com,net,org,edu,gov,';
$countryLevelTLDs['la']=',com,net,org,';
$countryLevelTLDs['lb']=',com,org,net,edu,gov,mil,';
$countryLevelTLDs['lc']=',com,edu,gov,net,org,';
$countryLevelTLDs['lv']=',com,net,org,edu,gov,mil,id,asn,conf,';
$countryLevelTLDs['ly']=',com,net,org,';
$countryLevelTLDs['ma']=',co,net,org,press,ac,';
$countryLevelTLDs['mk']=',com,';
$countryLevelTLDs['mm']=',com,net,org,edu,gov,';
$countryLevelTLDs['mn']=',com,org,edu,gov,museum,';
$countryLevelTLDs['mo']=',com,net,org,edu,gov,';
$countryLevelTLDs['mt']=',com,net,org,edu,tm,uu,';
$countryLevelTLDs['mx']=',com,net,org,gob,edu,';
$countryLevelTLDs['my']=',com,org,gov,edu,net,';
$countryLevelTLDs['na']=',com,org,net,alt,edu,cul,unam,telecom,';
$countryLevelTLDs['nc']=',com,net,org,';
$countryLevelTLDs['ng']=',ac,edu,sch,com,gov,org,net,';
$countryLevelTLDs['ni']=',gob,com,net,edu,nom,org,';
$countryLevelTLDs['np']=',com,net,org,gov,edu,';
$countryLevelTLDs['nz']=',ac,co,cri,gen,geek,govt,iwi,maori,mil,net,org,sc';
$countryLevelTLDs['nz'].='hool,';
$countryLevelTLDs['om']=',com,co,edu,ac,gov,net,org,mod,museum,biz,pro,med,';
$countryLevelTLDs['pa']=',com,net,org,edu,ac,gob,sld,';
$countryLevelTLDs['pe']=',edu,gob,nom,mil,org,com,net,';
$countryLevelTLDs['pg']=',com,net,ac,';
$countryLevelTLDs['ph']=',com,net,org,mil,ngo,';
$countryLevelTLDs['pl']=',aid,agro,atm,auto,biz,com,edu,gmina,gsm,info,mai';
$countryLevelTLDs['pl'].='l,miasta,media,mil,net,nieruchomosci,nom,org,pc,p';
$countryLevelTLDs['pl'].='owiat,priv,realestate,rel,sex,shop,sklep,sos,szko';
$countryLevelTLDs['pl'].='la,targi,tm,tourism,travel,turystyka,';
$countryLevelTLDs['pk']=',com,net,edu,org,fam,biz,web,gov,gob,gok,gon,gop,';
$countryLevelTLDs['pk'].='gos,';
$countryLevelTLDs['ps']=',edu,gov,plo,sec,';
$countryLevelTLDs['pt']=',com,edu,gov,int,net,nome,org,publ,';
$countryLevelTLDs['py']=',com,net,org,edu,';
$countryLevelTLDs['qa']=',com,net,org,edu,gov,';
$countryLevelTLDs['re']=',asso,com,nom,';
$countryLevelTLDs['ro']=',com,org,tm,nt,nom,info,rec,arts,firm,store,www,';
$countryLevelTLDs['ru']=',com,net,org,gov,pp,';
$countryLevelTLDs['sa']=',com,edu,sch,med,gov,net,org,pub,';
$countryLevelTLDs['sb']=',com,net,org,edu,gov,';
$countryLevelTLDs['sd']=',com,net,org,edu,sch,med,gov,';
$countryLevelTLDs['se']=',tm,press,parti,brand,fh,fhsk,fhv,komforb,kommuna';
$countryLevelTLDs['se'].='lforbund,komvux,lanarb,lanbib,naturbruksgymn,sshn';
$countryLevelTLDs['se'].=',org,pp,';
$countryLevelTLDs['sg']=',com,net,org,edu,gov,per,';
$countryLevelTLDs['sh']=',com,net,org,edu,gov,mil,';
$countryLevelTLDs['st']=',gov,saotome,principe,consulado,embaixada,org,edu';
$countryLevelTLDs['st'].=',net,com,store,mil,co,';
$countryLevelTLDs['sv']=',com,org,edu,gob,red,';
$countryLevelTLDs['sy']=',com,net,org,gov,';
$countryLevelTLDs['th']=',ac,co,go,net,or,';
$countryLevelTLDs['tn']=',com,net,org,edunet,gov,ens,fin,nat,ind,info,intl';
$countryLevelTLDs['tn'].=',rnrt,rnu,rns,tourism,';
$countryLevelTLDs['tr']=',com,net,org,edu,gov,mil,bbs,gen,';
$countryLevelTLDs['tt']=',co,com,org,net,biz,info,pro,int,coop,jobs,mobi,t';
$countryLevelTLDs['tt'].='ravel,museum,aero,name,gov,edu,nic,us,uk,ca,eu,es';
$countryLevelTLDs['tt'].=',fr,it,se,dk,be,de,at,au,';
$countryLevelTLDs['tv']=',co,';
$countryLevelTLDs['tw']=',com,net,org,edu,idv,gov,';
$countryLevelTLDs['ua']=',com,net,org,edu,gov,';
$countryLevelTLDs['ug']=',ac,co,or,go,';
$countryLevelTLDs['uk']=',co,me,org,edu,ltd,plc,net,sch,nic,ac,gov,nhs,pol';
$countryLevelTLDs['uk'].='ice,mod,';
$countryLevelTLDs['us']=',dni,fed,';
$countryLevelTLDs['uy']=',com,edu,net,org,gub,mil,';
$countryLevelTLDs['ve']=',com,net,org,co,edu,gov,mil,arts,bib,firm,info,in';
$countryLevelTLDs['ve'].='t,nom,rec,store,tec,web,';
$countryLevelTLDs['vi']=',co,net,org,';
$countryLevelTLDs['vn']=',com,biz,edu,gov,net,org,int,ac,pro,info,health,n';
$countryLevelTLDs['vn'].='ame,';
$countryLevelTLDs['vu']=',com,edu,net,org,de,ch,fr,';
$countryLevelTLDs['ws']=',com,net,org,gov,edu,';
$countryLevelTLDs['yu']=',ac,co,edu,org,';
$countryLevelTLDs['ye']=',com,net,org,gov,edu,mil,';
$countryLevelTLDs['za']=',ac,alt,bourse,city,co,edu,gov,law,mil,net,ngo,no';
$countryLevelTLDs['za'].='m,org,school,tm,web,';
$countryLevelTLDs['zw']=',co,ac,org,gov,';
$countryLevelTLDs['org']=',eu,dk,';
$countryLevelTLDs['com']=',au,br,cn,de,eu,gb,hu,no,qc,ru,sa,se,uk,us,uy,za,';
$countryLevelTLDs['net']=',de,gb,uk,';
$countryLevelTLDs['no']=',tel,';
$countryLevelTLDs['nr']=',fax,mob,mobil,mobile,tel,tlf,';
/* The function title is 'deceiving' :) It can take both URL or
hostname (like www.host.com) and extract the domain name from it! */
function url2Domain($url){
    
$hostName=$url;
    if(
strstr($url,"/")){
        
// Let's find the host if this is an URL
        
$parsedURL=parse_url($url); $hostName=$parsedURL['host'];<;br />         // Otherwise we assume it's a hostname
    
}
    
// It's gotta be lower case! It's how sane hostNames roll!
    
$hostName strtolower($hostName);
    
// Let's do a sanity check on the hostName. If it's insane ther'e no reason to carry on!
    // Sanity check means: characters must be only . - a-z 0-9
    // One posioness character has to end the insanity and long PHP code that follows!
    
if(preg_match("/[^a-z0-9\.\-]/i",$hostName)) return false;
    
// Let's check the host before all the next code. If the parameter is wrong
    // we spare CPU time :) It has to have at least to parts like domain.tld
    
$hostSlices explode(".",$hostName);
    if(
count($hostSlices)<2) return false;
    
//Now we extract the hostname parts from it!
    
$TLD array_pop($hostSlices); // TLD is last
    
$ccTLD array_pop($hostSlices); // [Optional] ccTLD is 2nd last
    // Under no circumstance will the last to parts of the hostname be shorter then 2
    // There's no such thing as x.y or host.x.y
    
if((strlen($TLD)<2) || (strlen($ccTLD)<2)) return false;
    
//-- Get one more ... Just In Case a wicked ccTLD is present :)
    
$oneMoreJIC array_pop($hostSlices); /// JIC is 2nd or 2rd last
    // Let's tell the code we need the global declared array
    
global $countryLevelTLDs;
    
// If it can't be found we ought to die here so we don't look bad later on!
    
if(!count($countryLevelTLDs)) return false;
    
// We search to see if we find the TLD and ccTLD in the array above
    
if(!isset($countryLevelTLDs[$TLD]) || !strstr($countryLevelTLDs[$TLD],",$ccTLD,")){
        
// If we don't find it we assume last to peices of the hostname are the domain
        
return $ccTLD.".".$TLD;
    }
    
// We found the ccTLD in the lists above. Now let's see how we handle this!
    
if(!strlen($oneMoreJIC)){
        
// If we fail here it means you provided something like co.uk as parameter
        // which by itself is a country level TLD hence this function will gracefuly fail!
        
return false;
    }
    return 
$oneMoreJIC.".".$ccTLD.".".$TLD;
}
?>

Testing the function!

To test the function paste it in a blank PHP file, then paste the below code and run it!

<?
// The demo looks much better as plain text :)
header("Content-Type: text/plain");
// Now it's the function tryouts
// Some random tests! Make your own!
var_dump(url2Domain("x.y"));
var_dump(url2Domain("host.x.y"));
var_dump(url2Domain("http://domain.co.uk/"));
var_dump(url2Domain("yahoo.co.uk"));
var_dump(url2Domain("yahoo.uk"));
var_dump(url2Domain("http://google.com/search?q=query"));
var_dump(url2Domain("http://google/search?q=query"));
// End here!
exit();
?>

And this is my (yours to be) output from the above!

bool(false)
bool(false)
string(12"domain.co.uk"
string(11"yahoo.co.uk"
string(8"yahoo.uk"
string(10"google.com"
bool(false)

Last but not least!

I've worked a lot this last week and had no updates but soon I'll get some cool new tutorials up but meanwhile enjoy this function. It's a one man show and it ain't easy :)

5 Comments Posted By Readers :

Add your comment
#1 Garcia from Bolivia web
Posted on Thursday, 06 December, 2007
What if the host name is "test.a.la"? or any other weird host name from http://freedns.afraid.org/domain/registry/?

Btw, that zappo bar on the right is extremely annoying.
#2 5ubliminal web
Posted on Thursday, 06 December, 2007
Good point and maybe I'll alter it a bit. Just remove second strlen check and it's done! But those domains are not in my are of interest.

I'm glad the Zappos bar is annoying :)
#3 Duncan Morris from Great Britain web
Posted on Tuesday, 08 April, 2008
I've been reading your blog from afar for a while now, and love reading how you go about coding stuff. I'd consider myself to be an expert in php but still learn from reading how you go about doing things..

Thanks for sharing this function, the array of domains you create is worth its weight in gold, and was exactly what I was looking for..
#4 5ubliminal web
Posted on Tuesday, 08 April, 2008
@DM: Thanks and you're welcome:)
#5 Alex from Romania web
Posted on Wednesday, 17 September, 2008
Just wanted to thank you for this script, I needed something like this on my site www.siteuri.org so I can quickly discover whether a domain is valid or not and your tool saved me a lot of time and frustration.
Thanks for sharing it!:)
Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Tuesday, 07 October, 2008 - 12:50:54 GMT]   No Ajax / Flash Used Here
" Extract Domain Name From Host Name In PHP : 5ubliminal's TellinYa "