Someone said it's allready in PHP but he's slightly wrong. parse_url will, indeed, bring you the hostname! But between hostname and domain name there's a huge difference.
For http://www.asite.co.uk/ parse_url will return www.asite.co.uk as hostname but the domain name is : asite.co.uk. And the list you'll find here is quite updated including centranic's uk.com and such TLDs and 2nd Level TLDs. Read what these are below!
Over a month ago I published the Domain From URL in C++. Same theory applies here but this code is in PHP. I think some of you may find good use for it! The code will extract Domain Name from Host Name or URL and will Contain Domain, TLD and maybe a Country Level TLD (2nd level TLD).
As some will noticed I have a different approach in the PHP version compared to the C++ one. I don't use a big chunk of text but I split it on Top Level Domain (TLDs). This is done to make it as fast as it should be! The string matching is done on very small slices. I enclosed the text between ,(comma) and use ,(comma) as a separator. This way I won't miss the first entry, last entry and I won't find partial matches and it will use on single strstr call to work! Any other way would have needed more than just one function.
<?
/* I decided to instantiate this array outside the function. I'm not sure how PHP works
but this has to be faster than initializing the array everytime it will be used! */
$countryLevelTLDs=array(); // The MEGA array of country level TLDs ... edit as you need!
/*
I put this togther in a peculiar way. You will see it starts, ends and separates with , (comma)
Why I did this? I wanted to reduce execution fat as much as posible.
If you edit the array always make sure each entry is split by , and it starts and ends in ,
This is to ensure the function won't find partial matches or won't miss the first or last term!
I also wanted to reduce all this to a quick string match and no RegExp or complex code!
-- It need to be fast :)
*/
$countryLevelTLDs['ac']=',com,edu,gov,net,mil,org,';
$countryLevelTLDs['ae']=',com,net,org,gov,ac,co,sch,pro,';
$countryLevelTLDs['ai']=',com,org,edu,gov,';
$countryLevelTLDs['ar']=',com,net,org,gov,mil,edu,int,';
$countryLevelTLDs['at']=',co,ac,or,gv,priv,';
$countryLevelTLDs['au']=',com,gov,org,edu,id,oz,info,net,asn,csiro,telemem';
$countryLevelTLDs['au'].='o,conf,otc,id,';
$countryLevelTLDs['az']=',com,net,org,';
$countryLevelTLDs['bb']=',com,net,org,';
$countryLevelTLDs['be']=',ac,belgie,dns,fgov,';
$countryLevelTLDs['bh']=',com,gov,net,edu,org,';
$countryLevelTLDs['bm']=',com,edu,gov,org,net,';
$countryLevelTLDs['br']=',adm,adv,agr,am,arq,art,ato,bio,bmd,cim,cng,cnt,c';
$countryLevelTLDs['br'].='om,coop,ecn,edu,eng,esp,etc,eti,far,fm,fnd,fot,fs';
$countryLevelTLDs['br'].='t,ggf,gov,imb,ind,inf,jor,lel,mat,med,mil,mus,net';
$countryLevelTLDs['br'].=',nom,not,ntr,odo,org,ppg,pro,psc,psi,qsl,rec,slg,';
$countryLevelTLDs['br'].='srv,tmp,trd,tur,tv,vet,zlg,';
$countryLevelTLDs['bs']=',com,net,org,';
$countryLevelTLDs['ca']=',ab,bc,mb,nb,nf,nl,ns,nt,nu,on,pe,qc,sk,yk,gc,';
$countryLevelTLDs['ck']=',co,net,org,edu,gov,';
$countryLevelTLDs['cn']=',com,edu,gov,net,org,ac,ah,bj,cq,gd,gs,gx,gz,hb,h';
$countryLevelTLDs['cn'].='e,hi,hk,hl,hn,jl,js,ln,mo,nm,nx,qh,sc,sn,sh,sx,tj';
$countryLevelTLDs['cn'].=',tw,xj,xz,yn,zj,';
$countryLevelTLDs['co']=',arts,com,edu,firm,gov,info,int,nom,mil,org,rec,s';
$countryLevelTLDs['co'].='tore,web,';
$countryLevelTLDs['cr']=',ac,co,ed,fi,go,or,sa,';
$countryLevelTLDs['cu']=',com,net,org,';
$countryLevelTLDs['cy']=',ac,com,gov,net,org,';
$countryLevelTLDs['dk']=',co,';
$countryLevelTLDs['do']=',art,com,edu,gov,gob,org,mil,net,sld,web,';
$countryLevelTLDs['dz']=',com,org,net,gov,edu,ass,pol,art,';
$countryLevelTLDs['ec']=',com,edu,fin,med,gov,mil,org,net,';
$countryLevelTLDs['ee']=',com,pri,fie,org,med,';
$countryLevelTLDs['eg']=',com,edu,eun,gov,net,org,sci,';
$countryLevelTLDs['er']=',com,net,org,edu,mil,gov,ind,';
$countryLevelTLDs['es']=',com,org,gob,edu,nom,';
$countryLevelTLDs['et']=',com,gov,org,edu,net,biz,name,info,';
$countryLevelTLDs['fj']=',ac,com,gov,id,org,school,';
$countryLevelTLDs['fk']=',com,ac,gov,net,nom,org,';
$countryLevelTLDs['fr']=',asso,nom,barreau,com,prd,presse,tm,aeroport,asse';
$countryLevelTLDs['fr'].='dic,avocat,avoues,cci,chambagri,gouv,greta,medeci';
$countryLevelTLDs['fr'].='n,notaires,pharmacien,port,veterinaire,';
$countryLevelTLDs['ge']=',com,edu,gov,mil,net,org,pvt,';
$countryLevelTLDs['gg']=',co,org,sch,ac,gov,ltd,ind,net,alderney,guernsey,';
$countryLevelTLDs['gg'].='sark,';
$countryLevelTLDs['gr']=',com,edu,gov,net,org,';
$countryLevelTLDs['gt']=',com,edu,net,gob,org,mil,ind,';
$countryLevelTLDs['gu']=',com,edu,net,org,gov,mil,';
$countryLevelTLDs['hk']=',com,net,org,idv,gov,edu,';
$countryLevelTLDs['hu']=',co,erotika,jogasz,sex,video,info,agrar,film,kony';
$countryLevelTLDs['hu'].='velo,shop,org,bolt,forum,lakas,suli,priv,casino,g';
$countryLevelTLDs['hu'].='ames,media,szex,sport,city,hotel,news,tozsde,tm,e';
$countryLevelTLDs['hu'].='rotica,ingatlan,reklam,utazas,';
$countryLevelTLDs['id']=',ac,co,go,mil,net,or,';
$countryLevelTLDs['il']=',co,net,org,ac,gov,muni,idf,';
$countryLevelTLDs['im']=',co,net,org,ac,gov,nic,';
$countryLevelTLDs['in']=',co,net,ac,ernet,gov,nic,res,gen,firm,mil,org,ind,';
$countryLevelTLDs['ir']=',ac,co,gov,id,net,org,sch,';
$countryLevelTLDs['je']=',ac,co,net,org,gov,ind,jersey,ltd,sch,';
$countryLevelTLDs['jo']=',com,org,net,gov,edu,mil,';
$countryLevelTLDs['jp']=',ad,ac,co,go,or,ne,gr,ed,lg,net,org,gov,hokkaido,';
$countryLevelTLDs['jp'].='aomori,iwate,miyagi,akita,yamagata,fukushima,ibar';
$countryLevelTLDs['jp'].='aki,tochigi,gunma,saitama,chiba,tokyo,kanagawa,ni';
$countryLevelTLDs['jp'].='igata,toyama,ishikawa,fukui,yamanashi,nagano,gifu';
$countryLevelTLDs['jp'].=',shizuoka,aichi,mie,shiga,kyoto,osaka,hyogo,nara,';
$countryLevelTLDs['jp'].='wakayama,tottori,shimane,okayama,hiroshima,yamagu';
$countryLevelTLDs['jp'].='chi,tokushima,kagawa,ehime,kochi,fukuoka,saga,nag';
$countryLevelTLDs['jp'].='asaki,kumamoto,oita,miyazaki,kagoshima,okinawa,sa';
$countryLevelTLDs['jp'].='pporo,sendai,yokohama,kawasaki,nagoya,kobe,kitaky';
$countryLevelTLDs['jp'].='ushu,utsunomiya,kanazawa,takamatsu,matsuyama,';
$countryLevelTLDs['kh']=',com,net,org,per,edu,gov,mil,';
$countryLevelTLDs['kr']=',ac,co,go,ne,or,pe,re,seoul,kyonggi,';
$countryLevelTLDs['kw']=',com,net,org,edu,gov,';
$countryLevelTLDs['la']=',com,net,org,';
$countryLevelTLDs['lb']=',com,org,net,edu,gov,mil,';
$countryLevelTLDs['lc']=',com,edu,gov,net,org,';
$countryLevelTLDs['lv']=',com,net,org,edu,gov,mil,id,asn,conf,';
$countryLevelTLDs['ly']=',com,net,org,';
$countryLevelTLDs['ma']=',co,net,org,press,ac,';
$countryLevelTLDs['mk']=',com,';
$countryLevelTLDs['mm']=',com,net,org,edu,gov,';
$countryLevelTLDs['mn']=',com,org,edu,gov,museum,';
$countryLevelTLDs['mo']=',com,net,org,edu,gov,';
$countryLevelTLDs['mt']=',com,net,org,edu,tm,uu,';
$countryLevelTLDs['mx']=',com,net,org,gob,edu,';
$countryLevelTLDs['my']=',com,org,gov,edu,net,';
$countryLevelTLDs['na']=',com,org,net,alt,edu,cul,unam,telecom,';
$countryLevelTLDs['nc']=',com,net,org,';
$countryLevelTLDs['ng']=',ac,edu,sch,com,gov,org,net,';
$countryLevelTLDs['ni']=',gob,com,net,edu,nom,org,';
$countryLevelTLDs['np']=',com,net,org,gov,edu,';
$countryLevelTLDs['nz']=',ac,co,cri,gen,geek,govt,iwi,maori,mil,net,org,sc';
$countryLevelTLDs['nz'].='hool,';
$countryLevelTLDs['om']=',com,co,edu,ac,gov,net,org,mod,museum,biz,pro,med,';
$countryLevelTLDs['pa']=',com,net,org,edu,ac,gob,sld,';
$countryLevelTLDs['pe']=',edu,gob,nom,mil,org,com,net,';
$countryLevelTLDs['pg']=',com,net,ac,';
$countryLevelTLDs['ph']=',com,net,org,mil,ngo,';
$countryLevelTLDs['pl']=',aid,agro,atm,auto,biz,com,edu,gmina,gsm,info,mai';
$countryLevelTLDs['pl'].='l,miasta,media,mil,net,nieruchomosci,nom,org,pc,p';
$countryLevelTLDs['pl'].='owiat,priv,realestate,rel,sex,shop,sklep,sos,szko';
$countryLevelTLDs['pl'].='la,targi,tm,tourism,travel,turystyka,';
$countryLevelTLDs['pk']=',com,net,edu,org,fam,biz,web,gov,gob,gok,gon,gop,';
$countryLevelTLDs['pk'].='gos,';
$countryLevelTLDs['ps']=',edu,gov,plo,sec,';
$countryLevelTLDs['pt']=',com,edu,gov,int,net,nome,org,publ,';
$countryLevelTLDs['py']=',com,net,org,edu,';
$countryLevelTLDs['qa']=',com,net,org,edu,gov,';
$countryLevelTLDs['re']=',asso,com,nom,';
$countryLevelTLDs['ro']=',com,org,tm,nt,nom,info,rec,arts,firm,store,www,';
$countryLevelTLDs['ru']=',com,net,org,gov,pp,';
$countryLevelTLDs['sa']=',com,edu,sch,med,gov,net,org,pub,';
$countryLevelTLDs['sb']=',com,net,org,edu,gov,';
$countryLevelTLDs['sd']=',com,net,org,edu,sch,med,gov,';
$countryLevelTLDs['se']=',tm,press,parti,brand,fh,fhsk,fhv,komforb,kommuna';
$countryLevelTLDs['se'].='lforbund,komvux,lanarb,lanbib,naturbruksgymn,sshn';
$countryLevelTLDs['se'].=',org,pp,';
$countryLevelTLDs['sg']=',com,net,org,edu,gov,per,';
$countryLevelTLDs['sh']=',com,net,org,edu,gov,mil,';
$countryLevelTLDs['st']=',gov,saotome,principe,consulado,embaixada,org,edu';
$countryLevelTLDs['st'].=',net,com,store,mil,co,';
$countryLevelTLDs['sv']=',com,org,edu,gob,red,';
$countryLevelTLDs['sy']=',com,net,org,gov,';
$countryLevelTLDs['th']=',ac,co,go,net,or,';
$countryLevelTLDs['tn']=',com,net,org,edunet,gov,ens,fin,nat,ind,info,intl';
$countryLevelTLDs['tn'].=',rnrt,rnu,rns,tourism,';
$countryLevelTLDs['tr']=',com,net,org,edu,gov,mil,bbs,gen,';
$countryLevelTLDs['tt']=',co,com,org,net,biz,info,pro,int,coop,jobs,mobi,t';
$countryLevelTLDs['tt'].='ravel,museum,aero,name,gov,edu,nic,us,uk,ca,eu,es';
$countryLevelTLDs['tt'].=',fr,it,se,dk,be,de,at,au,';
$countryLevelTLDs['tv']=',co,';
$countryLevelTLDs['tw']=',com,net,org,edu,idv,gov,';
$countryLevelTLDs['ua']=',com,net,org,edu,gov,';
$countryLevelTLDs['ug']=',ac,co,or,go,';
$countryLevelTLDs['uk']=',co,me,org,edu,ltd,plc,net,sch,nic,ac,gov,nhs,pol';
$countryLevelTLDs['uk'].='ice,mod,';
$countryLevelTLDs['us']=',dni,fed,';
$countryLevelTLDs['uy']=',com,edu,net,org,gub,mil,';
$countryLevelTLDs['ve']=',com,net,org,co,edu,gov,mil,arts,bib,firm,info,in';
$countryLevelTLDs['ve'].='t,nom,rec,store,tec,web,';
$countryLevelTLDs['vi']=',co,net,org,';
$countryLevelTLDs['vn']=',com,biz,edu,gov,net,org,int,ac,pro,info,health,n';
$countryLevelTLDs['vn'].='ame,';
$countryLevelTLDs['vu']=',com,edu,net,org,de,ch,fr,';
$countryLevelTLDs['ws']=',com,net,org,gov,edu,';
$countryLevelTLDs['yu']=',ac,co,edu,org,';
$countryLevelTLDs['ye']=',com,net,org,gov,edu,mil,';
$countryLevelTLDs['za']=',ac,alt,bourse,city,co,edu,gov,law,mil,net,ngo,no';
$countryLevelTLDs['za'].='m,org,school,tm,web,';
$countryLevelTLDs['zw']=',co,ac,org,gov,';
$countryLevelTLDs['org']=',eu,dk,';
$countryLevelTLDs['com']=',au,br,cn,de,eu,gb,hu,no,qc,ru,sa,se,uk,us,uy,za,';
$countryLevelTLDs['net']=',de,gb,uk,';
$countryLevelTLDs['no']=',tel,';
$countryLevelTLDs['nr']=',fax,mob,mobil,mobile,tel,tlf,';
/* The function title is 'deceiving' :) It can take both URL or
hostname (like www.host.com) and extract the domain name from it! */
function url2Domain($url){
$hostName=$url;
if(strstr($url,"/")){
// Let's find the host if this is an URL
$parsedURL=parse_url($url); $hostName=$parsedURL['host'];<;br />
// Otherwise we assume it's a hostname
}
// It's gotta be lower case! It's how sane hostNames roll!
$hostName = strtolower($hostName);
// Let's do a sanity check on the hostName. If it's insane ther'e no reason to carry on!
// Sanity check means: characters must be only . - a-z 0-9
// One posioness character has to end the insanity and long PHP code that follows!
if(preg_match("/[^a-z0-9\.\-]/i",$hostName)) return false;
// Let's check the host before all the next code. If the parameter is wrong
// we spare CPU time :) It has to have at least to parts like domain.tld
$hostSlices = explode(".",$hostName);
if(count($hostSlices)<2) return false;
//Now we extract the hostname parts from it!
$TLD = array_pop($hostSlices); // TLD is last
$ccTLD = array_pop($hostSlices); // [Optional] ccTLD is 2nd last
// Under no circumstance will the last to parts of the hostname be shorter then 2
// There's no such thing as x.y or host.x.y
if((strlen($TLD)<2) || (strlen($ccTLD)<2)) return false;
//-- Get one more ... Just In Case a wicked ccTLD is present :)
$oneMoreJIC = array_pop($hostSlices); /// JIC is 2nd or 2rd last
// Let's tell the code we need the global declared array
global $countryLevelTLDs;
// If it can't be found we ought to die here so we don't look bad later on!
if(!count($countryLevelTLDs)) return false;
// We search to see if we find the TLD and ccTLD in the array above
if(!isset($countryLevelTLDs[$TLD]) || !strstr($countryLevelTLDs[$TLD],",$ccTLD,")){
// If we don't find it we assume last to peices of the hostname are the domain
return $ccTLD.".".$TLD;
}
// We found the ccTLD in the lists above. Now let's see how we handle this!
if(!strlen($oneMoreJIC)){
// If we fail here it means you provided something like co.uk as parameter
// which by itself is a country level TLD hence this function will gracefuly fail!
return false;
}
return $oneMoreJIC.".".$ccTLD.".".$TLD;
}
?>
To test the function paste it in a blank PHP file, then paste the below code and run it!
<?
// The demo looks much better as plain text :)
header("Content-Type: text/plain");
// Now it's the function tryouts
// Some random tests! Make your own!
var_dump(url2Domain("x.y"));
var_dump(url2Domain("host.x.y"));
var_dump(url2Domain("http://domain.co.uk/"));
var_dump(url2Domain("yahoo.co.uk"));
var_dump(url2Domain("yahoo.uk"));
var_dump(url2Domain("http://google.com/search?q=query"));
var_dump(url2Domain("http://google/search?q=query"));
// End here!
exit();
?>
And this is my (yours to be) output from the above!
bool(false)
bool(false)
string(12) "domain.co.uk"
string(11) "yahoo.co.uk"
string(8) "yahoo.uk"
string(10) "google.com"
bool(false)
I've worked a lot this last week and had no updates but soon I'll get some cool new tutorials up but meanwhile enjoy this function. It's a one man show and it ain't easy :)
Post Feedback