Sublime directory Surf the web anonymous Pagerank Monitor


do I need to use fulltext search for one column?

olaf
Thu 27 July 2006, 11:29 pm GMT +0300
Hello,

I have a table with these two important columns:
"folder" (int) and domain_name (varchar(255).

I accept there will be 1mio records some time and I need to search fast inside the column domain_name for a match like:
Code:
SELECT folder FROM table WHERE domain_name = 'www.webdigity.com';


is this a problem? or is it better to use a fulltext search?

Nikolas
Fri 28 July 2006, 02:17 pm GMT +0300
I am not sure for the full text, but if you want to make this query faster you can break the domain to 3 fields (subdomain, domain, tld) and make them index

olaf
Fri 28 July 2006, 02:22 pm GMT +0300
I am not sure for the full text, but if you want to make this query faster you can break the domain to 3 fields (subdomain, domain, tld) and make them index
This is a really good idea thanks

Nikolas
Fri 28 July 2006, 02:37 pm GMT +0300
Welcome ;)

olaf
Fri 28 July 2006, 02:40 pm GMT +0300
dou you think that this way 1-2 mio. records are not a problem?

Nikolas
Fri 28 July 2006, 02:42 pm GMT +0300
I used this type in a database of 2 mil and it was ok.

Of course for big tables using mysql 5 (instead of 4) is allways better :)

olaf
Fri 28 July 2006, 02:50 pm GMT +0300
I see another problem, there are this kind of entries possible:
subdomein.domein.nl
www.domein.nl Visit through proxy
domein.nl

above its possible to split them while using regex

but what with this kind of domain names:
maniche18.blogs.sapo.pt

I saw another URL's with more dots like
www.domein.wh.at.ev.er Visit through proxy

maybe its better to remove the www only?

olaf
Fri 28 July 2006, 02:51 pm GMT +0300
I used this type in a database of 2 mil and it was ok.

Of course for big tables using mysql 5 (instead of 4) is allways better :)
yes that's possible if this host works fine (you know which host I'm talking about)

Nikolas
Fri 28 July 2006, 02:57 pm GMT +0300
If you want to make it better, you should hold a table with all the valid tlds, and associate it with the domain.

This way you will keep the subdomain(s), the domain, and the id of the tld.
This is propably the faster, but needs some work...

olaf
Fri 28 July 2006, 03:17 pm GMT +0300
If you want to make it better, you should hold a table with all the valid tlds, and associate it with the domain.

This way you will keep the subdomain(s), the domain, and the id of the tld.
This is propably the faster, but needs some work...

yes, maybe this is the best solution... what about index the domain name without www?

Nikolas
Fri 28 July 2006, 03:21 pm GMT +0300
That would be better, but with no significant reduce in overhead.

olaf
Fri 28 July 2006, 09:51 pm GMT +0300
I am not sure for the full text, but if you want to make this query faster you can break the domain to 3 fields (subdomain, domain, tld) and make them index
Hey Nick,

I checked this but how do you parse an unkown domain name format like
domain.com
or domain.co.uk
or sub.domain.co.uk

to resolve the three parts?
checking the last part against 280 tld's?

Nikolas
Sat 29 July 2006, 02:07 am GMT +0300
First you get the tld by checkink the last characters of the domain with the tlds table,

Then you remove the tld part and split the domain with the . character, like this:

Code:
<?php
$dom 
explode '.'$domainLeft );
$topDom array_pop $dom );
$restDom implode '.'$dom );

olaf
Sat 29 July 2006, 02:13 am GMT +0300
Quote
First you get the tld by checkink the last characters of the domain with the tlds table,

somthing like:

SELECT tld FROM table WHERE 'www.domein.co.uk Visit through proxy' LIKE %tld

? maybe its too late but I don't get it... splitting all other parts is not the problem...

Nikolas
Sat 29 July 2006, 02:16 am GMT +0300
I guess that will work, but I am not really in a mood to think too :)

Check it out and let me know if it works (or not) so we can solve it tommorow :)

olaf
Sat 29 July 2006, 02:20 am GMT +0300
I guess that will work, but I am not really in a mood to think too :)

Check it out and let me know if it works (or not) so we can solve it tommorow :)
Ok I will go to bed now (to get fresh ideas the next day)

Nikolas
Sat 29 July 2006, 02:23 am GMT +0300
Good night man, see ya tomorow :) (now this thread looks more like a chat zone, lol )

olaf
Sat 29 July 2006, 12:23 pm GMT +0300
I played a little bit more with this domain split.

I think there are more poblems:

example: www.domain.co.uk Visit through proxy

there are two mathches in my tld table: co.uk and uk

another problem is the sub domain because this is optional, how does the script knows that?

(by the way the select statemant doesn't work) ;)

Nikolas
Sat 29 July 2006, 12:44 pm GMT +0300
I will try to make this with pseudo - php code :

Code:
<?php

$domain 
'whatever.com';

$sql->query "SELECT tld FROM tlds ORDER BY LENGTH DESC" );//Meaning that it will give first the co.uk and then the uk
while ( $sql->next() )
{
  if ( 
substr $domain, -1strlen$sql['tld'] ) = $sql['tld'] )
  {
     
//You found the tld
  $domainLeft substr $domain0, ( strlen($sql['tld'] - strlen($domain) + )); // It will left the whatever, without the extension and the .
  if ( strpos $domainLeft'.' ) === false //It has subdomain
  {
      $subDomain '';
      $finalDomain $domainLeft;
  }else{
     $subDomain explode '.'$domainLeft );
     $finalDomain array_pop $subDomain );
     $subDomain implode '.'$subDomain );
  }
  }
}



I am not sure if it is working, but the logic is correct (or I have to drink another coffee to think it better :) )

olaf
Sat 29 July 2006, 02:50 pm GMT +0300
Hey Nick,

Thank you that brings back the development time for this feature to only a few hours ;)

after some modification, this is the result:
Code:
<?php
$domain 
"mail.finalwebsites.co.uk";

$res mysql_query("SELECT tld FROM domain_info ORDER BY LENGTH(tld) DESC"); //Meaning that it will give first the co.uk and then the uk
while ($arr mysql_fetch_assoc($res)) {
$tmp_tld substr($domain, -strlen($arr['tld']));
if ($tmp_tld == $arr['tld']) { // You found the tld
$tld $arr['tld'];
$domainLeft substr($domain0, -(strlen($tld) + 1)); // It will left the whatever, without the extension and the .
if (strpos($domainLeft".") === false) { // It hasn't a subdomain
$subDomain "";
$finalDomain $domainLeft;
} else {
$domain_parts explode("."$domainLeft);
$finalDomain array_pop($domain_parts); // select the domain and remove it from the array
$subDomain implode("."$domain_parts); // a subdomain can more then one parts seperated with dot's
}
echo "The tld is: ".$tld."<br>";
echo "the domain name is :".$finalDomain."<br>";
echo "the subdomain is: ";
echo (!empty($subDomain)) ? $subDomain "n/a";
echo "<br>";
break;
}
}


by the way this function is a rare one (searched already the internet for a while). I like to post this script to the tutorial section, what about the tld table? Maybe we should send this data on request? (to members only)

Nikolas
Sat 29 July 2006, 02:58 pm GMT +0300
Sure you can post it :)

The tutorials area is only viewable by the members anyway ;)

olaf
Sat 29 July 2006, 11:42 pm GMT +0300
Now I have splitted the domain in the columns:
sub_domain
domain
tld

do I have to set indexes for all three columns?

Nikolas
Sun 30 July 2006, 04:15 pm GMT +0300
You need a primary key for all of those (check them all together and add primary key from phpMyAdmin)

olaf
Sun 30 July 2006, 04:20 pm GMT +0300
got an error because the entries are not unique sinds I have records for different image sizes for the same domain name... do I have to change that?

Nikolas
Sun 30 July 2006, 04:27 pm GMT +0300
I don't got that. You keep for each domain one photo?

If you do then all 4 fields must be kept as a primary key.

olaf
Sun 30 July 2006, 04:31 pm GMT +0300
I don't got that. You keep for each domain one photo?

If you do then all 4 fields must be kept as a primary key.
yes I save max 4 thumbs for each domain (Chris is using two formats)

Now I understand how this works with the primary index, thanks (these 4 columns are the values I use to select a thumbnail)

Nikolas
Sun 30 July 2006, 04:33 pm GMT +0300
Quote
these 4 columns are the values I use to select a thumbnail

That's the point. You got it :)

Archive for SMF v1.00 by N.P. Valid XHTML 1.0 Transitional