Announcement

Symphony's issue tracker has been moved to Github.

Issues are displayed here for reference only and cannot be created or edited.

Browse

Closed#657: Query optimisation (ORDER BY and JOIN order)

We've discussed this before but never really found a conclusion, until now.

Am using this setup for testing:

Comments section (section ID 2) with fields: Name, Comment, Date, Approved y/n. 38,000 entries total.

I created a simple DS, filtering the Approved checkbox yes and limiting to 20 entries. When profiling the DS the initial query takes 0.14s, which is entirely respectable given the number of rows.

SELECT SQL_CACHE `e`.id, `e`.section_id, e.`author_id`, UNIX_TIMESTAMP(e.`creation_date`) AS `creation_date` FROM `sym_entries` AS `e` WHERE 1 AND `e`.`section_id` = '2' ORDER BY `e`.`id`DESC LIMIT 0, 20

Note that the DS has the default sort order on System ID descending.

I then changed the sort order to the Name field, to order comments by author name. This produces the following query which now takes 3.29s:

SELECT SQL_CACHE `e`.id, `e`.section_id, e.`author_id`, UNIX_TIMESTAMP(e.`creation_date`) AS `creation_date` FROM `sym_entries` AS `e` LEFT OUTER JOIN `sym_entries_data_6` AS `ed` ON (`e`.`id` = `ed`.`entry_id`) WHERE 1 AND `e`.`section_id` = '2' ORDER BY `ed`.`value` ASC LIMIT 0, 20

Notice the difference between them: the extra table has to be joined for the sort. The join itself doesn't slow the query down until it is used in the ORDER BY statement.

After a lot of reading I figured this is because the query is being forced to perform a "filesort" which, put simply, means that table indexes aren't being used for the join/ordering and they haveto be evaluated at runtime instead (through a temporary table I presume).

After some more reading I discovered that there are a lot of ways to prevent a filesort and improve performance. A lot.

The most obvious is to use FORCE INDEX and USE INDEX keywords nested in the query, to state explicitly to MySQL that certain indexes should be used. I tried this but concluded that it's too fiddly to accurately pepper Symphony's queries with these keywords without syntax errors cropping up.

Another idea was to use the STRAIGHT_JOIN keyword at the beginning of the query which attempts to optimise the joins and indexes in the most logical way. So, if I write my joins and where clauses in an inefficient order in the query (yes, the order matters) then STRAIGHT_JOIN will attempt to reorder them before executing the query. This made no difference.

Then I discovered this really simple example which concluded with two suggestions to avoid a filesort:

  1. You can avoid the filesort by making order by column appear in the where clause
  2. When using join, make sure the left side join table column is used in the ORDER BY clause or change the join type

Option #1 seems like the simplest solution so I gave it a go in the raw SQL (in Sequel Pro):

SELECT SQL_CACHE `e`.id, `e`.section_id, e.`author_id`, UNIX_TIMESTAMP(e.`creation_date`) AS `creation_date` FROM `sym_entries` AS `e` LEFT OUTER JOIN `sym_entries_data_6` AS `ed` ON (`e`.`id` = `ed`.`entry_id`) WHERE 1 AND `ed`.`value` IS NOT NULL AND `e`.`section_id` = '2' ORDER BY `ed`.`value` DESC LIMIT 0, 20

Notice the additional:

AND `ed`.`value` IS NOT NULL

This is there solely to add the column used in the ORDER BY clause (ed.value) into the WHERE clause. To add this to Symphony directly it can be done in one of two ways: modify the EntryManager to add an extra clause to the $where variable when sorting on a section field, or modify the $where within each field's buildSortingSQL method. I chose the latter since the column name might be different for each field so cannot be done in EntryManager.

So in field.input.php I added this to buildSortingSQL:

$where .= " AND `ed`.`value` IS NOT NULL";

Now when running the DS again it executes in an insane 0.0007s!

So, more testing is definitely needed, but it looks like this is a simple way to combat the filesort problems we've experienced on large sites. My slight hangup is that specifying IS NOT NULL might not be the correct behaviour for some fields. Sorting on a field shouldn't really add any filtering criteria to the query (i.e. when sorting by Name, would you also want to include those entries that do not have a Name value set? Perhaps.). So really we need an alternative way of getting the ORDER BY column into the WHERE, but have the WHERE not affect the query. IS NOT NULL was the first thing I thought of but there must be an alternative.

Happy querying.

For what it's worth, the same performance gain can be seen in the backend. When viewing the Comments section and sorting by the Name column the page takes the 3.5s to load without the optimisation, and loads instantly after (38,000 entries, 17 per page).

Interesting. How about:

`ed`.`value` = `ed`.`value`

I guess it will be optimized and "removed" from query in the process, but maybe it will be enough to fix ORDER BY slow down?

Or, since e.id is selected anyway, maybe add it to the end of ORDER BY? Will that help?

SELECT SQL_CACHE `e`.id, `e`.section_id, e.`author_id`, UNIX_TIMESTAMP(e.`creation_date`) AS `creation_date` FROM `sym_entries` AS `e` LEFT OUTER JOIN `sym_entries_data_6` AS `ed` ON (`e`.`id` = `ed`.`entry_id`) WHERE 1 AND `e`.`section_id` = '2' ORDER BY `ed`.`value` DESC, `e`.`id` ASC LIMIT 0, 20

Nick and I discussed this at length last night on iChat and came up with a couple of things.

  • The query is slow because it ignores the INDEX on ed.value.
  • The LEFT JOIN makes the query go through every row in sym_entries and matches it to the data table, regardless of if the data table has a value or not. This behaviour is necessary to ensure that entries are displayed even if the field has no value (ie. You add a field to a section after 100 entries have already been added, the Author then sorts on that column)

We discovered that if we ed.value IS NOT NULL, or make it a RIGHT JOIN, the INDEX is used, and the query is much much much quicker, but we lose the entries where no data exists for the field.

I made the issue more generic and posted it on Stack Overflow for more thoughts.

Good idea with ed`.`value` = `ed`.`value Marcin, but unfortunately we found that even adding this extra WHERE clause has the same effect as changing to a RIGHT JOIN. The column added to the WHERE must match the column used to sort, so you cannot add id to the WHERE but sort by value, this won't have an effect.

As Brendan has clarified, the query optimisation occurs when the indexes are used. The indexes are used under two conditions:

  • changing the joins to RIGHT JOIN, or
  • adding this extra WHERE clause

Two approaches to the same effect. The query is much faster, but they both have the side effect that if you're sorting by a field (column) and entries do not have values for this value, they will not appear in the query result. So in the backend you find entries start disappearing from the table without explanation.

As the Stack Overflow thread suggests:

there won't be an index that readily supports a "Foreign Key not present" sort of query

If we want the behaviour whereby entries are still returned when the sorting column has no value for that entry, then we have to use a LEFT JOIN, and the result of a LEFT JOIN is poorer performance.

How about LEFT JOIN with WHERE:

(`ed`.`value` IS NULL OR `ed`.`value` = `ed`.`value`)

That should include both entries with and without value set, but i am not sure if it will speed up anything at all (i have no database with large enough data set to test it, sorry).

Or even better (just in case database will try to actually compare strings for every row):

(`ed`.`value` IS NULL OR `ed`.`value` IS NOT NULL)

:).

We did try both of these variations when working this through, but neither will work. For the desired sorting behaviour (retaining rows that do not have a value for the ORDER BY column), the theory is that the additional WHERE clause must always evaluate to TRUE, hence like you, we tried conditions that included both positive and negative results. However when this occurs, a filesort is still performed.

I urge you to try it for yourself. The following query is one from the EntryManager:

SELECT `e`.id, `e`.section_id, e.`author_id`, UNIX_TIMESTAMP(e.`creation_date`) AS `creation_date`
FROM `sym_entries` AS `e`
LEFT OUTER JOIN `sym_entries_data_6` AS `ed` ON (`e`.`id` = `ed`.`entry_id`)
WHERE 1
AND `e`.`section_id` = '2'
ORDER BY `ed`.`value` ASC
LIMIT 0, 20

Make sure sym_entries_data_6 contains many rows, say more than 50,000 to make testing very obvious. The above query will be slow (several seconds), but adding in additional WHEREs you can see whether performance improves. You must then try the WHERE in a situation where entries do not have a value in this table and make sure that they are also returned, while the query is fast also.

I think we've got to abandon this line of thought, since the behaviour we want doesn't lend itself to the good performance. We trade a little in performance for the behaviour we need.

(`ed`.`value` IS NULL OR `ed`.`value` IS NOT NULL)

Additionally the problem with that code is the MySQL optimiser is smart enough to realise that the above essentially cancels each other out, so it doesn't use the INDEX at all.

@nickdunn

It's looking that way, I would like to try the suggestion on SO of using a RIGHT JOIN and then UNION with the LEFT JOIN & a WHERE clause. Both queries (or sub queries if you like) would use an INDEX and the result set would be what we want. We'd just have to test and see if it's any faster!

MySQL optimiser is smart enough to realise that the above essentially cancels each other out, so it doesn't use the INDEX at all.

FORCE INDEX?

I would like to try the suggestion on SO of using a RIGHT JOIN and then UNION with the LEFT JOIN & a WHERE clause

Makes sense, but I'm not sure you could retrofit the EntryManager class to build queries in this way, given the way fields pass only WHERE and JOIN strings. But worth investigating for the next major version!

Query Optimisation [#657]

This was tested on MySQL 5.0.51a and MySQL 5.5.10, using 43k entries.

Original

A LEFT JOIN fails to uses filesort and temporary tables and generally scales quite badly as it doesn't use any INDEXes. This approach works though because it will still display entries that have no value in the ORDER BY field.

UNION a RIGHT JOIN and a LEFT JOIN

The idea behind this was suggested on Stack Overflow. The concept was the RIGHT JOIN would use the INDEX for value, and then a WHERE clause on the LEFT JOIN would also trigger the value INDEX. It had little effect on the result, sometimes worse, sometimes better.

Inset a whole bunch of other crazy ideas

Rowan and myself hacked away in our Sequel Pro consoles for at least half an hour experimenting with all sorts of weird queries, including sub queries, EXISTS, NOT IN, MySQL variables and even toying with the idea of just doing 2 separate queries.

The winner

The end result is not a final solution, rather a tweak on the current behaviour that squeezes some more speed out of a bad situation. At the moment Field's provide the SQL through buildSortingSQL, which (well for the core fields at least) build the JOIN and the ORDER BY. Our solution drops the JOIN and instead uses a sub query in the ORDER BY.

Old implementation
public function buildSortingSQL(&$joins, &$where, &$sort, $order='ASC'){
    $joins .= "LEFT OUTER JOIN `sym_entries_data_".$this->get('id')."` AS `ed` ON (`e`.`id` = `ed`.`entry_id`) ";
    $sort = 'ORDER BY ' . (in_array(strtolower($order), array('random', 'rand')) ? 'RAND()' : "`ed`.`value` $order");
}
New implementation
public function buildSortingSQL(&$joins, &$where, &$sort, $order='ASC') {
    if(in_array(strtolower($order), array('random', 'rand'))) {
        $sort = 'ORDER BY RAND()';
    }
    else {
        $sort = sprintf(
            'ORDER BY (
                SELECT %s 
                FROM sym_entries_data_%d AS `ed`
                WHERE entry_id = e.id
            ) %s',
            '`ed`.activated',
            $this->get('id'),
            $order
        );
    }
}

Testing across both configurations on 43k entries resulted in a 25% improvement when using this new style of query.

The best thing? Total backwards compatibility. All core fields can be updated to use it and all extensions that don't override it will get the benefit.

How about 2.3

While this idea works, we can do better! Rowan and I came up with an idea that would ensure default values exist for all entries regardless of when the field was added in a Section's lifetime. Implementation wise, we thought that for 2.3, when fields are created in the section editor, they are passed an array of Entry ID's in the current section. From this, the Field can INSERT default values into the table. This also can solve some filtering workarounds we currently have with the Checkbox field, in that the filter searches for values where value = $blah or if value is null. For Fields that choose to implement this, their buildSortingSQL function can output a result similar to the Old implementation, instead opting to use a RIGHT OUTER JOIN.

The scope for the callback doesn't have to stop with filling default rows though, it could be used to resave existing entries and apply new settings, ie. Reflection field

So little update from some implementation testing.

Short: We're back to square one.

Long: The sorting can only be applied to certain fields. Fields that store multiple entries in the database cause the subquery to return more than one row break the SQL. Adding ORDER BY and LIMIT statements to the subquery causes MySQL to freak out and just take a crazy amount of time to execute the query, which sucks.

So what I have done is just applied the new sort to fields that are safe. Of the core these are the Checkbox, Date, Input and Upload fields.

doh

Just a thought (like i wrote: i do not have access to big enough data set, to test those things) - how about using FULL LEFT OUTER JOIN and then:

WHERE `e`.`id` IS NOT NULL

Will FULL join make MySQL use indexes instead of filesort?

update: scratch that. I am idiot - looks like there is no full join in MySQL.

I like the idea of making sure that field has data for every entry, but i am not sure if first "import" should be done through PHP. I mean, list of entries can be huge, so either Symphony will have to implement some way to allow slicing tasks into smaller parts (which would be good for other things too. Drupal has API for that and it does help in a lot of cases, e.g., updating module that has a lot of data to "upgrade" or "convert") or use INSERT ... SELECT with default values (and then allow module to do whatever it needs to do).

I just tried few things (i was testing using EXPLAIN - no need for large data set to see if filesort is used or not :), but nothing helped. Looks like fields should insert default data for each entry if you want to optimize selection without huge and/or tricky workarounds.

It could be. The idea was the fields would just add default values, so a field might go through and set the id and entry_id values but the rest to null.

The task queue is an idea Rowan and I were tossing around today as well! We decided Symphony 3 is best for an idea like that.

edit Yep, that's the conclusion we came to as well.

This issue is closed.

Symphony • Open Source XSLT CMS

Server Requirements

  • PHP 5.3-5.6 or 7.0-7.3
  • PHP's LibXML module, with the XSLT extension enabled (--with-xsl)
  • MySQL 5.5 or above
  • An Apache or Litespeed webserver
  • Apache's mod_rewrite module or equivalent

Compatible Hosts

Sign in

Login details