“Improved Page Resolve” – Forum Thread – Discuss

- creativedutchmen
- 30 Jun 12, 3:19 am
- Comment #21

On the live server there were 4 pages also.

Could You create table like this:

You just won't give up, will you ;-)

I will not be able to test this on the live server (as I don't want to take it down for more than a few seconds), so I'll just run the test locally, and let you know how it went.

edit: Results:

mean: 1.130008698 ms
stddev: 0.105461126 ms

So not really an improvement. On my local database this query itself takes about 1.1 ms (as reported by the DB), so this is really where the overhead is.

- ahwayakchih
- 30 Jun 12, 3:31 am
- Comment #22

On the live server there were 4 pages also.

Thanks, which means that text based calculations are worse than i thought :(.

You just won't give up, will you ;-)

I need to be 100% sure before i give up ;).

I will not be able to test this on the live server (as I don't want to take it down for more than a few seconds), so I'll just run the test locally, and let you know how it went.

OK, thank You. I'm interested how much difference it will make. If those pre-calculated columns were in sym_pages it would be even better. Also comparing to 'index' could be pre-calculated (all at the time of saving page, so it should work OK, unless someone keeps modyfying pages directly through SQL)... but let's see if changing worst parts make any meaningful difference.

- ahwayakchih
- 30 Jun 12, 3:35 am
- Comment #23

mean: 1.130008698 ms stddev: 0.105461126 ms

Almost no difference at all (or even slower)!? That's crazy - simple JOIN shouldn't be slower than all the text cutting and gluing for each and every row :(.

- creativedutchmen
- 30 Jun 12, 3:37 am
- Comment #24

The text based operations are not really the problem: the joins, like and the sorts are. In the current query three tables are used, and on two of those no index is used.

Honestly, I think you can improve on this a bit by writing a better query, but it will not be easier to understand, and I doubt if you will be able to beat two straightforward queries, even with the database at a remote location.

That's crazy - simple JOIN shouldn't be slower than all the text cutting and gluing for each and every row :(.

Yes, it surprised me, too. But, we are on the <1ms level here. And the join creates a temporary table, which adds quite a bit of overhead.

- ahwayakchih
- 30 Jun 12, 4:38 am
- Comment #25

The text based operations are not really the problem: the joins, like and the sorts are. In the current query three tables are used, and on two of those no index is used.

Yes, but joins could be eliminated. Everything could be in a single table.

Honestly, I think you can improve on this a bit by writing a better query,

Agreed.

but it will not be easier to understand, and I doubt if you will be able to beat two straightforward queries, even with the database at a remote location.

"two queries" are just in case of a URL path = page name. With every added "node" to the URL path, there will be additional query run (Symphony calls PageManager::resolvePageByPath for each node, e.g., "test/one/two/three", "test/one/two", "test/one", "test"), and that is what i am trying to workaround. I often have pages with 3 or 4 parameters. Somometimes even more. Let's say that there is a page "test", which takes 3 parameters. That's 4 queries, so twice as much as for a page without parameters.

Yes, it surprised me, too. But, we are on the <1ms level here. And the join creates a temporary table, which adds quite a bit of overhead.

You're right. Keeping everything in a single table should work better. Adding index on sortorder and other columns used to order results should help a lot too.

- creativedutchmen
- 30 Jun 12, 5:21 am
- Comment #26

I often have pages with 3 or 4 parameters. Somometimes even more. Let's say that there is a page "test", which takes 3 parameters. That's 4 queries, so twice as much as for a page without parameters.

Ok, I do understand you don't like many queries because the DB is on a separate server, but what if I told you that Symphony will keep the connection open, and the latency with a quality hoster is so low that you will pretty much never be able to beat these 3 or 4 queries with one big query?

And even if you were able to get a query that runs faster than the 3 or 4 queries do now, what have you gained? At most 1 or 2 ms!

Please, please, please, if you are really looking for performance updates, rework your datasources, optimize the shit out of your templates, but look for the places where performance stinks!

- ahwayakchih
- 30 Jun 12, 5:46 am
- Comment #27

and the latency with a quality hoster is so low that you will pretty much never be able to beat these 3 or 4 queries with one big query?

Even if that is true, and i will get only equal results at best, i will learn a thing or two in the process :).

And even if you were able to get a query that runs faster than the 3 or 4 queries do now, what have you gained? At most 1 or 2 ms!

It can add up when it happens for almost every page (most of my pages use parameters).

Please, please, please, if you are really looking for performance updates, rework your datasources, optimize the shit out of your templates, but look for the places where performance stinks!

Haha, that too, of course :).

IPR and Filter field (now called Conditionalizer) were created for two reasons back then: to omit "home" in URL and to minimize number of database queries (because it was slowing down whole thing a lot). Well... maybe three, because i did not want to create hundreds of pages and data-sources (i prefer parameters).

How about a single table test :)?

CREATE TABLE `sym_pages_resolve` (`id` int(11) unsigned NOT NULL auto_increment,`parent` int(11) default NULL,`title` varchar(255) collate utf8_unicode_ci NOT NULL default '',`handle` varchar(255) collate utf8_unicode_ci default NULL,`path` varchar(255) collate utf8_unicode_ci default NULL,`params` varchar(255) collate utf8_unicode_ci default NULL,`data_sources` text collate utf8_unicode_ci,`events` text collate utf8_unicode_ci,`sortorder` int(11) NOT NULL default '0',`is_index` tinyint(1) NOT NULL default '0',`params_count` int(11) NOT NULL default '0',`fullpath` varchar(255) collate utf8_unicode_ci NOT NULL default '',PRIMARY KEY  (`id`),KEY `parent` (`parent`),KEY `sorter` (`is_index`,`params_count`,`sortorder`)) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
INSERT INTO `sym_pages_resolve` SELECT p.id, p.parent, p.title, p.handle, p.path, p.params, p.data_sources, p.events, p.sortorder, (pt.id > 0), (LENGTH(p.params)-LENGTH(REPLACE(COALESCE(p.params,''), '/', ''))+1), CONCAT_WS('/', p.path, p.handle, '%') FROM `sym_pages` p LEFT JOIN `sym_pages_types` pt ON pt.page_id = p.id AND pt.type = 'index' GROUP BY p.id;

And select:

SELECT p.* FROM sym_pages_resolve p WHERE '%s' LIKE p.fullpath OR (p.is_index = 1 AND p.params_count >= %d) ORDER BY p.is_index ASC, p.params_count DESC, p.sortorder DESC LIMIT 1

Still on filesort ("Using where; Using filesort"), but query is a lot simpler now.

- creativedutchmen
- 30 Jun 12, 6:08 am
- Comment #28

It can add up when it happens for almost every page (most of my pages use parameters).

With at most I meant; if you manage to create something that takes 0 time. So no, it will not really add up. Just like the 0.001ms on the RPP doesn't add up. It's like trying to dust your house with a plier. Of course you will be able to pick up pieces of dust, and yes, these pieces of dust will add up if you collect them all in a bucket. But will your girlfriend/wife notice the house getting cleaner? Not really.

Even if that is true, and i will get only equal results at best, i will learn a thing or two in the process :).

Fair enough.

I think you should drop the LIKE, and replace it with something like:

where path = '' or path = 'home' or path = 'home/page' or path = 'home/page/param' etc. Splitting up the string is very easy (and fast) in PHP using ex/implode. This will make the query quite a bit faster. Then, you can always sort by the string length of the matched nodes: the longest path should come out on top.

Now, because by itself this will return false positives, combine that with the parameter count join.

Or, alternatively, replace this one big query with a few separate queries, and see what happens! :)

edit: regarding your 1 table solution: still around 1ms.

- ahwayakchih
- 30 Jun 12, 6:23 am
- Comment #29

But will your girlfriend/wife notice the house getting cleaner? Not really.

That depends on how insane about cleanliness she can be ;).

I think you should drop the LIKE, and replace it with something like: [...]

I already tried it with something like:

fullpath IN ('test/one/two/three', 'test/one/two', 'test/one', 'test')

and it did not change much (both before, and now with single-table test). Splitting it into OR-ed comparisons did not make any difference (it probably evaluates to the same thing internally).

Then, you can always sort by the string length of the matched nodes: the longest path should come out on top.

Yes, or have node-count pre-calculated in table - sorting by indexed integer column should be a bit faster.

Or, alternatively, replace this one big query with a few separate queries, and see what happens! :)

You mean 5 queries :)?

- ahwayakchih
- 30 Jun 12, 7:24 am
- Comment #30

edit: regarding your 1 table solution: still around 1ms.

Compared to multiple queries for a page with 3 parameters?

I've just modified IPR to include parts of Your RPP. Wrapped each of blocks with microtime() calls, added some var_dump() and...

Best scenario for RPP:

string(23) "RPP: 0.000592947006226s"
string(21) "IPR: 0.0148978233337s"
string(24) "IPR2: 0.000533819198608s"

IPR2 is for single-table solution. Thing is that such results were showing up only at first try. Every next refresh was more like:

string(23) "RPP: 0.000527858734131s"
string(21) "IPR: 0.0001540184021s"
string(24) "IPR2: 0.000133991241455s"

Then i switched IPR with IPR2, just in case cache was messing up timing:

string(23) "RPP: 0.000648021697998s"
string(24) "IPR2: 0.000415086746216s"
string(21) "IPR: 0.0146610736847s"

And then again IPR was getting back on track with next refresh (so i am guessing that cache is more important for all that text-manipulation-voodoo, than for simple queries) - closer to IPR2 timing.

Actually best scenario for RPP was indeed closer to Your timing, but it was only at first try, before i noticed that i did not copy and paste __getIndexPage() and _get_fallback() functions from RPP's code (which means, that RPP did not run its query to select index page).

So, i think i'm staying with IPR - i'll probably just implement pre-calculation feature, which will minimize that "first page load" slow down of current implementation :).

- nickdunn
- 30 Jun 12, 8:54 am
- Comment #31

You guys are nuts. Brilliant, but nuts.

- creativedutchmen
- 30 Jun 12, 10:12 am
- Comment #32

Ah yes, pre-calculating everything in a separate table will speed things up. Nice to see those results.

I can tell you where the speed difference in reloading comes from: by default Symphony uses the cache as its source select SQL_CACHE, which is very fast if you query the same thing again after a short while.

I noticed the same thing with datasources and SBL fields: when you are driving a busy site, rewriting those queries from multiple to one saved a lot of time when the results could be fetched from the cache (load times drop from 500ms to 10ms). Unfortunately none of my websites are visited heavily enough to really benefit from this (the cache is stale before the next visitor arrives).

The thing I like about these results is that it shows that optimized tables with pre-calculated values and proper indexes are still roughly as fast as multiple simple selects from a standard table, as long as the dataset is small enough. In my eyes this is a big victory for simplicity: the actual logic in the RPP extension is only 36 lines long. While you are going to need much more if you want to keep this new table up to date :-)

By the way, can you put your code on github or somthing? I'd love to give it a whirl and see what the results are on a local installation!

- creativedutchmen
- 30 Jun 12, 10:25 am
- Comment #33

Oh, I forgot to mention this: the RPP is also benefitting from caching, but because the query is by itself already a straightforward lookup in the cache (that's what indexes do, after all:)), the results are less noticeable.

The biggest difference is in the overhead with the multiple queries. If the RPP would be doing hundreds of queries, you will see the results go up accordingly, but then caching doesn't do much. In my testing with the datasources, the results varied from 550 to 500, but not much less).

However, the single query will now be acting as a simple query (because of the cache), and the rest of the overhead is nonexistent.

For the record: are your results in ms or s?

- creativedutchmen
- 30 Jun 12, 10:35 am
- Comment #34

This is the last thing before I go to bed: my microtime seems to return the same values over and over again, so apparently I am hitting the accuracy limit of the function:

0.0019073486328125
0.0030994415283203
0.0019073486328125
0.0030994415283203
0.0030994415283203
0.0028610229492188
0.0030994415283203
0.0019073486328125
0.0030994415283203
0.0028610229492188

Long story short: if php can not track the time it takes to execute the function because of the internal accuracy, my optimizing job is done. Good night!

- Cremol
- 30 Jun 12, 11:57 am
- Comment #35

IPR vs RPP! Who,s gonna win? :-) those milliseconds seem a bit irellevant to me but it,s quite interesting!

- ahwayakchih
- 30 Jun 12, 4:41 pm
- Comment #36

by default Symphony uses the cache as its source select SQL_CACHE, which is very fast if you query the same thing again after a short while.

Yes, that is why i switched IPR and IPR2 places to see the difference - just in case the cache re-uses some data (it didn't for different queries, which is good ;).

Unfortunately none of my websites are visited heavily enough to really benefit from this (the cache is stale before the next visitor arrives).

That may be now, but You'll never know when they can become site-of-a-day, because someone popular mentioned them on blog/social-network/forum :).

In my eyes this is a big victory for simplicity: the actual logic in the RPP extension is only 36 lines long.

But You did not count the loops and text cutting and gluing done by Symphony, which is avoided when using IPR.

"pre-calculation" may sound complicated in this case, but it is not at all. Symphony keeps page paths and page handles separate (probably a leftover from old days, because fullpath is used more often than handle). That's OK, they may be useful that way, from time to time. But it could keep them as fullpath too. Number of parameters, kept as an integer, can be useful too. The only thing i do not like that much is keeping "is_index", because sooner or later someone will come and say "is_404" will be useful too, and then "is_my_type" too... :). Anyway, if Symphony stored that information in sym_pages, it could use a single query to resolve page. I think that it would actually simplify whole process a lot.

Now, even if Symphony will not change its way, and IPR will have its place to be, process of resolving pages will not become more complicated. The only complication will be one added delegate to save additional data to database whenever a page is being saved (and it will not have to use that voodoo SQL from example above, because it can calculate stuff in PHP). Sure, it's not ideal, but it's not really that bad if You can speed up popular website (where page views happen a lot more often than page writes :).

By the way, can you put your code on github or somthing? I'd love to give it a whirl and see what the results are on a local installation!

Here is IPR extension file with added var_dumps and code from RPP:

https://gist.github.com/3022618

This is a first run i did today, after no one used server for a few hours:

string(20) "RPP: 0.187596797943s"
string(20) "IPR: 0.109179973602s"
string(21) "IPR2: 0.067351102829s"

Without any cache ready to be used, even Symphony+RPP's "multiple simple queries" can be slow :).

Some tests, when i was recreating sym_pages_resolve table at every page view, were showing IPR2 still being faster (even though RPP did use cache, and IPR2 did not).

However, the single query will now be acting as a simple query (because of the cache), and the rest of the overhead is nonexistent.

Thing is that is not a single query in case of Symphony+RPP, RPP query for index is not that simple (JOIN may be expensive as we've seen above in case of IPR :), and IPR2 query is almost as simple as the one in Symphony.

For the record: are your results in ms or s?

Seconds.

php can not track the time it takes to execute the function because of the internal accuracy

Sure, it's not a scientific timing, but one can see that results are more or less consistent. RPP takes at least 0.0004s more time than IPR and IPR2 when cache is used, and is second (after IPR2, before IPR) when cache is not used. And the difference will go up with every additional parameter.

Even without a cache and without parameters (what would be a simplest page) i get something like this:

string(23) "RPP: 0.000620126724243s"
string(21) "IPR: 0.0176301002502s"
string(24) "IPR2: 0.000438928604126s"

Sure, the difference between RPP and IPR2 is just a tiny 0.0002s, but it's there and i don't see keeping 3 additional database columns as a big complication. Especially if it helps website to handle higher traffic and allows me to use as many parameters as i want or need :).