“A Few Questions” – Forum Thread – Discuss

- nickshepherd
- 25 Jan 08, 2:06 am
- Comment #1

First, I'd like to say thanks for the software. It's a nice piece of work. The interface to handle the development of applications is put together amazingly well. Despite the small bugs here and there you can tell that there has been a lot of attention to detail when putting everything together. I especially love the new debug screen. EXTREME Kudos on that one.

Now to the....surprising...parts for me. To precede these questions and concerns, none of these are deal breakers for me, I'm more looking for a reason why it was done this way because there may be a really good explanation for why. I'm more or less trying to pick you guys brain and maybe point out some downfalls that may or may not be present. It all deals with the database schema and interaction.

One of the first things I noticed when I visited the debug page for the default "application" is that there are 91 queries for the front page view. For a second I thought that the query page was just broken, because I thought it would be nearly impossible to have 91 non-redundant queries on one page view, seemed very high to me. So I decided to check out the database schema (so I can come up with a way to convert my old posts to the new version of symphony). I was a bit surprised to see individual tables for each section field. It makes some sense to me, but I'm not entirely sure where the benefits with this schema type over the one in 1.7 exist. Can you guys elaborate on why you went this route instead of sticking with the older schema that seemed to allow for a lot of optimizations with mysql 5 features and required less database interaction?

- andrrr
- 25 Jan 08, 5:08 am
- Comment #2

I join to a nshepherd's question.
Mainly, that's what I meant here -
http://symphony21.com/forum/discussions/13/1/

Guess, with 20 average entries on page, number of queries will be ~200.
If i'm not wrong, It could be serious restriction for middle/high-load projects.

With such number of DB queries it would be great to have some agressive caching mechanism.
Say, for page's xml output or for each DS output, I'm not sure.

Class.cacheable is in the core. Hope, caching will be possible in the core or as an extension.

Some lyrics :)
If I could help you in your hard work I have made it with pleasure. But, sorry, my programmer's skills are much lower then yours.

- Lewis
- 25 Jan 08, 5:45 am
- Comment #3

Alistair, what was it that convinced you to go this route?

- Alistair
- 25 Jan 08, 10:19 am
- Comment #4

1. The Problem

In Symphony 1.7 there was a single data table for entries called sym_entries2customfields. In S2 I initially transfered that idea over, renaming the table to be tbl_entries_data and making the structure more generic. It consisted of key, value pairs of data. value being a text field to accommodate any data. Each field wrote to the same table, setting whatever key and value it liked. The entry_id was also attached.

This works quite well since you can grab all the entry data in one SQL query. Life is good. That is, until you hit around 5000 entries (maybe even less). For a simple 'blogging' section, the data table explodes at a ratio of around 1:11. That is, 1 entry will produce around 11 rows in the data table. 5000 entries end up with around 55,000 data rows. Jump to 15,000 entries, which is what we used for benchmarking, and you end up with something like 165,000 rows with a table size around 100mb+. The reason for the growth was that, for example, the input field requires 3 things: handle, value and value_raw. The only way to store this using a key, value pair system is with 3 distinct rows, hence the 1:11 ratio.

The big problem with this approach is that you now have a serious bottle neck for ALL your sections. A section with only a single entry will suffer horribly just because some other section has so many entries. The SQL for grabbing an entry's data, using various filters, needs some JOINs of the data table. Joining a 165,000 row table to a 15,000 row table is really intense and we saw an exponential performance drop. It was taking around 8-9 seconds to return a typical request. And even longer when you needed to do a COUNT or use GROUP BY.

2. A potential Solution

So, we go to thinking. Obviously this is not acceptable, so what can we do to reduce the strain. Our biggest problem was that the data table was far too normalised. I think it would have classed as 4NF, which is of no real advantage, other than slowing everything down (not normally associated with advantages, but whatever). So I thought about how I could un-normalise it, without introducing too much redundancy.

I had read about a webmail site DB implementation, and they talked about generating new tables for every user. At first this stuck me as really weird, but the more I thought about it, the more it seemed to make sense. If I created a new table for each section's entry data, I could isolate out the bottlenecks to be section specific. The system could scale horizontally. Long story short, I did that, and the performance increase was quite dramatic... for sections that didn't have many entries. We still had a problem of trying to do JOIN's of a very large data set, albeit without the other sections data. My next thought was, "what if I make it even more granular?" I proceeded to create individual tables for each field in a section. This gave me some very nice benefits.

3. The solution

Firstly, each table could be properly indexed and table fields could be very specific to a field in Symphony, rather than using the key, value pair idea. Secondly, the table growth rate could be controlled better. By having each table with a more defined structure, I no longer need 3 rows to store handle, value and value_raw, but rather a single row with 3 fields. I could index handle, and define all of them as VARCHAR rather than TEXT. The ratio suddenly dropped to around 1:1 for most fields (some fields needed additional rows, like select boxes and tag lists). Another example is the textarea field which now could have a FULLTEXT index specified. Thirdly, because the data is more contained, deleting field's and their data was a breeze. DROP TABLE blah.

Lastly, the SQL was more optimised. It didn't need to JOIN a table full of irrelevant data, but rather a table consisting of ONLY relevant data. A 8-9 second query was now only taking about 0.2-0.3s with very minimal modification.

It also introduced something I had not thought of. Where before the developer had no control over DS optimisation, since JOINing a table with all the entry data can only be done one way, they now have complete control. Each filter and included field potentially has an effect on the performance of the SQL query. A developer can now be very specific about what they include to try and squeeze as much out of Symphony. They can see that sorting by text input, and grouping by category is going to be more intensive than not grouping and sorting by ID.

Coming Soon: The downsides to this approach

- Allen
- 25 Jan 08, 10:40 am
- Comment #5

I'll put my own spin on things (from a perspective of the guy sitting next to the DB guy). From the point of view of a web developer, these are my thoughts:

Firstly, Sure, the overhead baseline has increased when you have 20 to 50 entries compared to the old database structure but the ratio of growth is on average 1:1, not 1:11. What that means is say that if there is a theoretical upper limit of data a system can store, we've just increased that limit by 11 fold over.

Secondly, When I include a field in the data source editor, each field I add is a new table that I include. In 1.7, no matter what field I include, it has no effect on the efficiency of the query.

Thirdly, because each field is a table, the design of each field can be tailored. Text fields can be indexed, some fields can have searching features on and more importantly, developers can create their own field type!

Just my 21 cents.

- Alistair
- 25 Jan 08, 10:43 am
- Comment #6

Thanks Allen, they are the main points I was trying to convey. Prob got a bit lost in there somewhere. ;)

- Lewis
- 25 Jan 08, 11:10 am
- Comment #7

I have found both comments very informative.

- aBer
- 25 Jan 08, 6:49 pm
- Comment #8

Thankyou Alistair and Allen, things now are more clear to me :) I'll try to move the works i've already done on S2 to see the performances change!

- ekerazha
- 25 Jan 08, 8:00 pm
- Comment #9

The number of queries is less important than the time needed to execute every query: is it better 1x10s query or 10x0.01s queries? ;-)

- Allen
- 25 Jan 08, 9:08 pm
- Comment #10

ekerazha, precisely!

Of course, we're are always looking at optimisations, there are quite a few more things we've yet to implement which will reduce execution times further but ultimately - high query count or not - what matters is the overall execution time.

- carsten
- 25 Jan 08, 9:21 pm
- Comment #11

which is of no real advantage, other than slowing everything down (not normally associated with advantages, but whatever)

:D

Thanks for the explanation, very useful to know why you made this somewhat unorthodox decision. I'm wondering about the downsides I can't see here.

- Allen
- 25 Jan 08, 9:41 pm
- Comment #12

We've got a podcast coming which will discuss this issue in more detail :)

- nickshepherd
- 26 Jan 08, 12:08 am
- Comment #13

Well, there were a few things pointed out that I hadn't thought about with positives for this approach. The biggest negative however I see is the fact that there are more queries. Very true eze, 1x10s query is worse then 10x0.1s queries. However, their is still a downside. Those queries have to be sent to the mysql server. There are several problems here. First some shared hosting accounts have a strict query limit (consecutive queries going on at a time) let's say it's set to 100, and let's say on a single page you have records from 5 sections of data being pulled. Let's assume from the front page (where there are 2 sections being pulled I believe) that each section takes 50 queries to pull the basic data. Now, we are at 250 queries. The problem arises where data throughput happens. And on top of that other sites on that same server are battling for mysql resources. This is going to a possible extreme of a performance problem, but what I see is as long as the data being pulled for a page is simple everything is fine. If you have multiple sections of data being presented then you come to a problem where the actual bottleneck is no longer the performance of the server, but the speed in which the data and queries are transferred to the server.

There's another more rare problem, but it would still arise. Let's say that you use a remote server for your mysql stuff. I'm going to exaggerate the size of a query (data wise) so that it can be seen on the larger volume. Let's say that each query "packet" is 10kB (this is WAY over exagerated but it will show the scaling with larger sites that may have 10-15 sections presented on a single page) and you have just 100 queries on the database. Well, the pages queries itself would reach 1MB and would take a significant amount of time to just reach the mysql server, let alone the response from the server.

Now these two examples are exaggerated a bit and are just my thoughts on what would happen (hell I may be over exaggerating) so until some actual tests are done I may be worried about the problems way more than I need to be.

I don't want this to seem like me downsizing the amazing job you guys have done, because I am highly impressed with everything that you have presented us in this product.

Thanks again and I'm looking forward to hearing what downsides you have come up with for this method Alistair. :)

- bauhouse
- 26 Jan 08, 1:19 am
- Comment #14

For the sake of understanding where the entry data resides and how to easily eliminate obsolete data, I really like the solution the Symphony Team have implemented. I was wondering how I might go about achieving a clean install.

- newnomad
- 26 Jan 08, 7:17 pm
- Comment #15

nshepherd, wouldn't caching reduce the amount of requests? And so when a site gradually grows there is no overload, only when you import a filled DB into an install on a shared host?

- Alistair
- 26 Jan 08, 9:06 pm
- Comment #16

MySQL Query caching will not reduce the number of queries, however it will reduce the load and time required by each one. Is that what you are referring to newnomad?

I'm looking forward to hearing what downsides you have come up with

Actually, the main downside would be the number of queries, however, S2 is actually better than 1.7 with query count. We talk a bit about this in the upcoming podcast. Hopefully Allen will get this out soon.

- newnomad
- 27 Jan 08, 1:25 am
- Comment #17

I look forward to symphonians feedback on possible (non)issues on shared hosting accounts (not VPC) and also wonder how fields are handled db-wise in other scalable frameworks such as django.

- newnomad
- 18 Feb 08, 6:46 pm
- Comment #18

How is the podcast comming along? I need some convincing on the many queries.

- nickshepherd
- 19 Feb 08, 6:12 am
- Comment #19

newnomad: you bumping this topic made me go back and see your question. I think a caching system that would cache the data structures (as they relate to the database) would cut out on a lot of the queries. You'd basically be cutting out the queries that probe the database to determine how to pull the data from the database for each datasource. I think a lot of things could be done to optimize this query count, however I agree with Alistair after dissecting the innards a bit more that it is a necessary evil for allowing the end user to be able to do so much customizations so quickly and easily.

I think after a more polished version of S2 is released we'll see more and more people helping out with implementing these improvements. I know I've already looked a bit into how much work would need to be done to implement a caching system to minimize database hits.

- wilhelm
- 06 Mar 08, 3:14 am
- Comment #20

Have you guys considered some sort of intermediary caching solution (something other than MySQL query caching)? One of my last clients had a very large, very high-traffic website that pulled in upwards of 50 million page views a month. The code I had to work with was quite old and very inefficient with some pretty crazy query counts. All the traffic wound up slowing their farm down to a crawl. The bottleneck, of course, was our database cluster.

We ended up implementing a load-balanced array of memcached-enabled servers and compiled PHP to make use of them. It was a pretty simple task to augment our current database abstraction layer to make use of our new caching system. Once we promoted the new system to production we noticed an immediate boost in server performance and haven't had a slowdown or server crash due to traffic since.

I'm not saying use memcached, but an in-house solution would work just as well. Especially for sites with a lot of transient data and semi-static content. The only problem I would see is preserving the relationships among sections / custom fields / values in the new table structure of S2.

Symphony.

A Few Questions