Calculate the number of rows efficiently

发布时间：2020-05-27 08:54:23 所属栏目：MySql 来源：互联网

导读：Dealing with large data sets makes it necessary to pick out only the newest or the hottest elements and not displaying everything. In order to have older items still available, Pagination navigations have become established. However, implem

Dealing with large data sets makes it necessary to pick out only the newest or the hottest elements and not displaying everything. In order to have older items still available,Pagination navigation's have become established. However,implementing a Pagination with MySQL is one of those problems that can be optimized poorly with MySQL and certainly other RDBM systems. However,knowing the underlying database can also help in optimizing pagination queries,because there is no real copy and paste solution.

There are rattling around many alleged optimized ways on the web to do a fast pagination,but let's start with the worst query which is used very often,though:

Which is done in 0.00 sec. So,where is the problem? Actually,there is no problem with this query and their parameters,because the primary key of the following table is used and only 15 elements get read:

The real problem are clicks on sites with a large offset,like this:

Which takes about 0.22 sec on my data set with about 2M rows. AnEXPLAINshows,that 100015 rows were read but only 15 were really needed and the rest was thrown away. Large offsets are going to increase the data set used,MySQL has to bring data in memory that is never used! We could assume that most users just click around on the lower sites,but even a small number of requests with large offsets may endanger the entire system. Facebook has recognized this and doesn't optimize the database for many requests per second but to. With this in mind,we shouldn't take this loss and should use a different approach. Anyway,with pagination queries also another information is needed for the page calculation: the total number of elements. Well,you could get the number with a separate query like this very easily:

However,this takes 9.28 sec on my InnoDB table. An (inappropriate) optimization for this job isSQL_CALC_FOUND_ROWS,which reduces the calculation and the fetch into one query. But keeping the queries simple and short doesn't result in a performance gain in most cases. So,lets see how this query performs,which unfortunately is used in some major frameworks as the standard pagination routine:

Ouch,we've doubled the time to 20.02 sec. UsingSQL_CALC_FOUND_ROWSfor pagination is the worst idea,because there is noLIMIToptimization: ALL rows must be read for counting and just 15 rows get returned. There are also tips around to ignore indexes in order to perform faster. This isn't true,at least when you need to sort the table. The following query takes about 3 minutes:

If you need further information of when to use SQL_CALC_FOUND_ROWS and when not,take a look at the article on.

Okay,let's start with the real optimization. There is a lot to do in order to optimize Pagination queries. I'll split up the article into two sections,the first covering how we can get the number of resulting rows and the second to get the actual rows.

Calculate the number of rows efficiently

In the case you want to paginate a cache table where you still use MyISAM,you can run a simpleCOUNT(*)query in order to get the number of rows. Also HEAP tables store the absolute number of rows in their meta information. More complicated,however,it is for transactional storage engines like InnoDB,where different numbers of rows exist at any time.

If you insist,that the pagination is always based on the correct number of rows,cache the value somewhere and update it periodically via a background process or when the cae must invalidated by an user action under the usage of an explicit NOT NULL index usingUSE INDEX,like:

If writes are no problem for you,you could also add an aggregate table which is maintained with INSERT and DELETE triggers or multi_query's in order to save some latency.

But do we really need an exact number of elements? Especially for really big data sets? Does it interest you if there are 38211 elements instead of 39211 on a random site you're visiting somewhere on the net? Considering this,we could approximate the number just as well and output something like "40 to 80 of over 1000" in the user interface. This naturally requires a prevention of jumping to the last page and a rethinking of the paginator layout.

If your data set is really huge,I would recommend to use a kind of infinity pagination,as I implemented it with my(negative number of elements set it to infinity).

If you want to build a pagination for search results,it's generally the case that the important stuff should be on the first page. If this isn't the case,optimize your search quality rather than optimizing the pagination in order to allow users browsing the whole result set. In the following,I'd like to focus more on gathering a goodestimationfor a result,which also has relevance for search results.

When you need an estimation of the number of rows of an table rather than a subset,a good starting point can be a SHOW query,which executes quite quickly:

Another idea would be using the cardinality of a column with unique elements,like anauto_incrementcolumn:

However,you don't need the whole table in most cases. And if so,it's certainly a cache table with MyISAM,where you can run a fastCOUNT(*). For a good estimation of only a part of the query,try to take the output ofEXPLAINinto account,e.g.:

In this example the correctness of the estimation is about 99.91%,but there are cases where an estimation can have a deviation of about 15% and more. Mark Callaghan suggested to implement afor InnoDB as a newESTIMATED_COUNT()function. I would be glad to see COUNT(ESTIMATE *) in preference to his approach if the parser must be modified,as we already have theDISTINCTmodifier and another flag looks quite natural.

Another estimation approach for the number of rows of a table is using theinformation_schema. I abuse this meta information schema for optimizations very heavily in the last time,as you'll see in further articles. So,if a table doesn't get deletes,we could use theauto_incrementvalue as the number of rows and are done:

If you have gaps in your table and especially in your auto_increment range,try to figure out what percentage of gaps you have:

Cache this value somewhere and use it for estimation. The following query illustrates the usage with a more complete example:

This query returns a good estimation based on the auto_increment value and the percentage of gaps in the range. Additionally,you get the number rounded more user friendly and you'll get a column called "more",which indicates if there are further elements to show a "and more" or something like this.

Get the elements

Okay,we get to the more important part of this article,the retrieval of the page elements. As indicated above,large offsets slow down the entire system,thus we have to rewrite the queries in order to make usage of an index. As an illustration I create a new table "news",where we sort by topicality and realize an efficient pagination on it. For simplicity,we suppose the newest elements also have the highest ID:

A very fast approach is using a query which is based on the last ID the user has seen. The query for the next page looks like this,where you have to pass the id of the last element on a page:

The query for the previous page looks similar,where you have to pass the id of the first element on a page and sort in reverse order (sure,you havee to sort the resulting rows again):

（编辑：安卓应用网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!