15. April 2020
8 minutes reading time
We had a problem: Large queries that collected lots of data from AWS DynamoDB (DDB) were taking a long timelangTime. We had a data table consisting of about 68,000 entries with a primary hash key of about 35,000 elements. The query to return all the data on all these items took about 10 seconds and people on the internet don't like to wait that long. So I set out to find out what is making these queries slow and what can be done to improve these query times...
Long story short (I'll try to keep the entire post short and to the point), we're starting to think that Amazon's Boto3 library isn't doing its job very well (or at least not very quickly).
Throughout this post, I'll refer frequently to a benchmarking script I wrote to see how effective it was at improving DDB query performance. This benchmark script simply ran the query mentioned above and used Python's Timeit module to measure how long the operation took.
Since we were already using boto3, we initially decided that it would be more appropriate to try to improve its performance.
The first benchmark turned out to be quite poor and required a whopping 15 seconds. However, it later turned out that this was due to the restricted readability of the table (15). Due to the way the query works, if the query contains many items (>1MB), you will need to make multiple requests to the API. I found that certain queries were throttled when read capacity was exceeded (normal queries took about 0.7 seconds, throttled queries took up to 3 seconds). After ensuring readabilitydefinitiveenough (100), no requests were throttled, and the query completed in about 10 seconds fairly consistently.
*This tip may not apply to many situations (if at all; why would anyone store data they don't need?), but there are numbers here that might be of interest to some, and it might be interesting to combine those results with the next to compare tip.
The next interesting point to examine was the table design in general. To better understand Amazon's SaaS-compliant black box that is DDB, I created a table consisting of the same number of items as the original test data and the same primary hash key with approximately 35,000 items. This table contained only the required data from the same schema as the original table (1 primary hash/sort key pair and a local secondary index). Since the table contained significantly less data (~2MB less than ~24MB in the original table), it's not really surprising that the query took less time. However, it might be interesting to note that although fewer requests were required (2 instead of 12), each request took longer on average. I don't know enough about it to know why that is.
10 seconds is still a long time and nobody will be happy to wait that long. The next thing that can help improve query time with boto3 is providing a projection in the query (only ask for specific fields of the data and not whole items). By performing the query with a projection of only the 2 primary keys (hash and sort) and the primary local index, it was possible to reduce the time required from ~10 to ~3 seconds. This was a reduction in stats from as low as 20 (although not all items have all stats) to a total of 3. The interesting thing here is that despite retrieving less data in the query, the same number of requests were made. This seems to indicate that the 1MB response limit is calculated based on the total size of the items retrieved, since inspecting the response data only shows the predicted attributes (which seems to explain the lower time per request seen below) .
It's important to note that, similar to the previous tip, this may not always be true. However, it just goes to show that if you don't need all of the data in the table at any given time, it can be quite worthwhile.
Doing this yourself won't bring any speedup (at least in theory), but it does allow data to be collected in a non-sequential way (non-sequential meaning in an asynchronous or parallel way). Since the API/response is always paged out and at the end of each response it returns the next sort key from which to restart the query, it is not possible to dynamically parallelize (desynchronize? desequentialize?) the query. However, by knowing this information beforehand, the data cannot be collected sequentially. Because this requires fundamental changes to the way data is stored in a table, this may not be the most attractive option. In fact, I didn't even do this when collecting the time for these results, I actually just scraped a list of "ExclusiveStartKey"s (sort key values) from the regular query and used it to simulate the hash multiple key scenario (However, as actual data/tables change over lifetime, this would not be possible outside of the test environment).
After getting these request thresholds, I wrote a method to collect the results asynchronously using Python's async andaiodynamomodules. This method queues the requests with aiodynamo's client query method as mentioned earlier and collects the results asynchronously. This resulted in an acceleration of about 4 seconds, from ~10 seconds to ~6 seconds. Additional acceleration may be possible through parallelization. However, the difficulty of configuring a database to integrate this might be too difficult to be worth the speedup.
After some research we decided that maybe all this effort isn't worth it... Maybe boto3 is hopelessly slow?
That was a good thought.
After profiling the query operation with boto3, we found that boto3 takes a ridiculous amount of time just to check, recheck and triple check the validity/structure of the received data and analyze the response data.
When examining these traces, we are most interested in the tottime field, as this determines the actual time spent during execution (i.e. all tottimes added together give the total execution time).
The _handle_structure and _parse_shape methods are methods provided by botocore to parse AWS response data into boto3 compatible data structures. However, this is largely unnecessary as it is possible to predict the correct/early structure of the data and immediately parse (or at least try to parse) the data into that structure without the intermediate steps imposed by boto3. To that end, we decided to try to implement our own request method for AWS. Luckily, Amazon actually offersExamples of how to do this in Python(It's like they're trying to say something...) so I just copied and pasted most of the code and modified it to run a query instead of creating a table (and AWS sessions to use).
The results were incredible... Without "tricks" like projections and desequencing (I'm talking about that word) the time for a query of ~35000 items went from ~10 seconds for boto3 to ~4 seconds!
This could be a very good argument against using boto3. However, it really depends on the use case. If you're looking for more usability without having to develop your own module, boto3 might be easier to use on a whim. However, if the data needs to be available to users quickly, it may be worth investing a little (little) time in developing a request-based method of accessing AWS. Moreover, since each of the methods of "speeding up" boto3 had some drawback (lack of data, difficulty in implementation), it may not even be worth pursuing them any further. As an aside, it should be borne in mind that this data was only collected from a query method perspective (and a heavily controlled example of that, although I would assume the data would carry over to the general case).
Honestly not much. Aside from the Amazon example I provided earlier, you really only need to change a few things. Add variables to make the code more generic (in the example it's pretty much fully coded), put things in useful functions, replace the request type (i.e. Amazon's custom method signature) with the request you want to achieve, and you are ready to go. When querying (that's what this benchmarking is based on, by the way) you have to deal with data paging (but you have to do that with boto3 anyway...). Finally, if you need to transform the response data into useful models that are part of your codebase, you probably need a way to deserialize the response data into the model(s). However, this is relatively little work, and (hopefully) a custom deserializer is faster than the boto3 generic (like ours was).
The script used in this benchmark query can be foundHere.
David is a student at the University of Sydney studying Software Engineering (graduating late 2020). From January 20th to February 28th, 2020, David was one of two HENNGE Global Interns. Apply now and win the IT challengeHere.
How to speed up DynamoDB Query? ›
Use Global Secondary Indexes
DynamoDB encourages using Global Secondary Indexes to speed up queries on non-key attributes. Therefore, when defining your GSI, ensure that you only project non-key attributes that get used. It helps create the lowest possible latency as the scanned attributes in the index are less.
The only way to effectively and efficiently query DynamoDB data in AWS is to export it to a system that handles a full SQL dialect and can query the data in a way that is not painfully slow. The two best options for the destination system are: Amazon Redshift, which has its own storage mechanism for data.How fast are DynamoDB queries? ›
You might think that DynamoDB Query operation is fast, but it has its own limits. As per documentation: A single Query operation will read up to a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression .How can I improve my DynamoDB scan performance? ›
For faster response times, design your tables and indexes so that your applications can use Query instead of Scan . (For tables, you can also consider using the GetItem and BatchGetItem APIs.) Alternatively, design your application to use Scan operations in a way that minimizes the impact on your request rate.How to reduce latency in DynamoDB? ›
Reduce the distance between the client and DynamoDB endpoint: If you have globally dispersed users, consider using global tables. With global tables, you can specify the AWS Regions where you want the table to be available. This can significantly reduce latency for your users.How can I improve my query speed? ›
- Do not use * in select Statment.
- Use Exists instead of Sub Query.
- Use Proper join instead of subqueries.
- Use “Where” instead of “Having” a clause.
- Apply index on necessary columns.
- For user-defined stored procedures avoid prefixes like “sp_”
- 1: Check your database server.
- 2: Improve indexing strategies.
- 3: Identify access to database.
- 4: Evaluate connection capacity.
- 5: Optimize Queries.
- 6: Database Performance Resources.
The way to make a query run faster is to reduce the number of calculations that the software (and therefore hardware) must perform. To do this, you'll need some understanding of how SQL actually makes calculations.Is DynamoDB good for query? ›
Since DynamoDB is a NoSQL data model, it handles less structured data more efficiently than a relational data model, which is why it's easier to address query volumes and offers high performance queries for item storage in inconsistent schemas.Is DynamoDB faster than MongoDB? ›
DynamoDB is lightning fast (faster than MongoDB) so DynamoDB is often used as an alternative to sessions in scalable applications. DynamoDB best practices also suggests that if there are plenty of data which are less being used, move it to other table.
Can DynamoDB handle complex queries? ›
Amazon DynamoDB supports a variety of query operations that allow you to retrieve data from a table flexibly and efficiently. The query operations allow you to filter and sort the data based on one or more attributes and can be combined to form complex queries.Can DynamoDB handle millions of records? ›
DynamoDB is a key-value and document database that supports tables of virtually any size with horizontal scaling. DynamoDB scales to more than 10 trillion requests per day and with tables that have more than ten million read and write requests per second and petabytes of data storage.How optimize SQL Query with millions of records? ›
- Avoid auto-increment primary key. Most of the systems today not only target a single region, but it could also be a global market. ...
- Avoid joining table records (left join, outer join, inner join, etc) ...
- Don't use SQL lock. ...
- Avoid aggregation functions. ...
- Try to use SQL function only with a single record query.
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression or ProjectionExpression is applied to the results. If LastEvaluatedKey is present in the response and is non-null, you must paginate the result set (see Paginating table query results).How does DynamoDB accelerator work? ›
How it works. DAX is designed to run within an Amazon Virtual Private Cloud (Amazon VPC) environment. Amazon VPC defines a virtual network that closely resembles a traditional data center. With a VPC, you have control over its IP address range, subnets, routing tables, network gateways, and security settings.How can I tell if DynamoDB is throttling? ›
To find the most accessed and throttled items in your table, use the Amazon CloudWatch Contributor Insights. Amazon CloudWatch Contributor Insights is a diagnostic tool that provides a summarized view of your DynamoDB tables traffic trends and helps you identify the most frequently accessed partition keys.What is the difference between scan and Query in DynamoDB? ›
DynamoDB offers two ways to access information stored: Query and Scan. A Query will rely on the primary-key to find information. Query can point directly to a particular item (or set ot items) and retrieve them in a fast and efficient way. Scan, as the name suggests, will browse table items from start to finish.How do I lower my AWS latency? ›
You can improve network latency by prioritizing data packets based on type. For example, you can make your network route high-priority applications like VoIP calls and data center traffic first while delaying other types of traffic.How do you increase loaded latency? ›
- Turn off any downloads, and be sure to check for anything that's downloading in the background.
- Close any unused applications or browser tabs.
- Check for malware. ...
- Use an Ethernet cable to connect your device to your router or modem, if at all possible.
- Reduce the buffer size. ...
- Raise the sample rate. ...
- Disable the Audio Input Device. ...
- Use ASIO audio drivers on Windows. ...
- Use a dedicated audio interface running native drivers. ...
- Don't use Bluetooth devices or cast audio.
How do I fix slow queries? ›
- Examine the query plan of the query.
- Update Statistics.
- Identify and apply Missing Indexes. ...
- Redesign or rewrite the queries.
- Identify and resolve parameter-sensitive plans.
- Identify and resolve SARG-ability issues.
Steps to take to improve performance of queries:
- Create all primary and foreign keys and relationships among tables. - Avoid using Select*, rather mention the needed columns and narrow the resultset as needed. - Implement queries as stored procedures. - Have a WHERE Clause in all SELECT queries.
- 1) Clarify Your Information Needs:
- 2) Check the WHERE Clause:
- 3) Check the GROUP BY and ORDER BY Clauses:
- 4) Zoom Out to Consider All Data Requests:
- 5) Merge Indexes and Delete the Ones You Don't Need:
- 6) Define Your Asterisk!
- Indexing. ...
- Compression. ...
- 6 Different Types of Slowly Changing Dimensions and How to Apply Them? ...
- Collecting Statistics. ...
- Creating Building Blocks and Views. ...
- Partitioning & Sharding. ...
- Query Optimisation. ...
- Data Minimisation.
SQL Query optimization is defined as the iterative process of enhancing the performance of a query in terms of execution time, the number of disk accesses, and many more cost measuring criteria. Data is an integral part of any application.How do you increase DB performance and make it more scalable? ›
- Replication. Replication refers to creating copies of a database or database node. ...
- Partitioning (aka Sharding) Partitioning distributes data across multiple nodes in a cluster. ...
- Partitioning with Replication.
Queries can become slow for various reasons ranging from improper index usage to bugs in the storage engine itself. However, in most cases, queries become slow because developers or MySQL database administrators neglect to monitor them and keep an eye on their performance.When should we not use DynamoDB? ›
When not to use DynamoDB: When multi-item or cross table transactions are required. When complex queries and joins are required. When real-time analytics on historic data is required.Can I query DynamoDB without sort key? ›
2. Can we query DynamoDB without a sort key? Yes, you can use the partition key value of the item to query data.Which is faster DynamoDB or S3? ›
2) Amazon S3 vs DynamoDB: Purpose
For relatively small items, especially those with a size of less than 4 KB, DynamoDB runs individual operations faster than Amazon S3. DynamoDB can scale on-demand, but S3 offers better scalability.
What is faster than DynamoDB? ›
Speed. MongoDB stores query data in RAM which means query performance is significantly better than what you would find with Dynamo.Is Redis faster than DynamoDB? ›
Because DynamoDB is NoSQL, so Insert/Delete is so fast(slower than Redis, but we don't need to that much speed), and store data permanently.Why not use MongoDB but use DynamoDB? ›
Compared to MongoDB, DynamoDB has limited support for different data types. For example, it supports only one numeric type and does not support dates. As a result, developers must preserve data types on the client, which adds application complexity and reduces data re-use across different applications.How many requests can DynamoDB handle? ›
Fast and flexible NoSQL database service for any scale
DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second.
Each partition on a DynamoDB table is subject to a hard limit of 1,000 write capacity units and 3,000 read capacity units. If the workload is unevenly distributed across partitions, or if the workload relies on short periods of time with high usage (a burst of read or write activity), the table might be throttled.Which database is best for complex queries? ›
As your queries are complex, SQL is the way to go. MongoDB is what's known as a NoSQL database. It is very fast, however it sacrifices robustness and has weak relational guarantees. Contrasting, SQL is resistant to outages and guarantees consistency between queries.What are the disadvantages of DynamoDB? ›
- Limited Querying Options. ...
- Difficult To Predict Costs. ...
- Unable to Use Table Joins. ...
- Limited Storage Capacities For Items. ...
- On-Premise Deployments.
DynamoDB and its NoSQL brethren are essentially infinitely scalable thanks to the power of horizontal scaling. But there's a big caveat there: it scales infinitely and offers blazing performance at any scale if you properly model your data.How many tables should I have in DynamoDB? ›
As a general rule, you should maintain as few tables as possible in a DynamoDB application. To better understand why that is (keeping few tables, ideally only one) and why it might be beneficial, let's briefly review the DynamoDB data model.Which query is faster to execute and get the second highest salary? ›
We can nest the above query to find the second largest salary. select *from employee group by salary order by salary desc limit 1,1; There are other ways : SELECT name, MAX(salary) AS salary FROM employee WHERE salary IN (SELECT salary FROM employee MINUS SELECT MAX(salary) FROM employee);
What is the fastest query method to fetch data from the table? ›
Using ROW ID is the fastest query method to fetch data from the table.How is big query so fast? ›
unprecedented performance: Columnar Storage. Data is stored in a columnar storage fashion which makes possible to achieve a very high compression ratio and scan throughput. Tree Architecture is used for dispatching queries and aggregating results across thousands of machines in a few seconds.How fast is query in DynamoDB? ›
DynamoDB offers 40,000 WCUs per second (depending on the region), indicating that the table can handle 40,000 writes per second for items of 1KB in size. DynamoDB will throttle the requests when the write throughput exceeds and cause latency. You can increase that if needed.How much does it cost for a query in DynamoDB? ›
DynamoDB Streams are charged at $0.02 per 100,000 read operations. Data requested by requesters outside the AWS region where the DynamoDB table is deployed is charged at $0.09 per GB.Does DynamoDB query return all items? ›
A Scan operation in Amazon DynamoDB reads every item in a table or a secondary index. By default, a Scan operation returns all of the data attributes for every item in the table or index. You can use the ProjectionExpression parameter so that Scan only returns some of the attributes, rather than all of them.Which of the following is the fastest way to get an item from DynamoDB? ›
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item because it provides direct access to the physical location of the item. (DynamoDB also provides the BatchGetItem operation, allowing you to perform up to 100 GetItem calls in a single operation.)How long does it take for DynamoDB to scale up? ›
DynamoDB auto scaling increases provisioned capacity when utilization exceeds 70 RCUs for at least two consecutive minutes. DynamoDB auto scaling decreases provisioned capacity when utilization is 20% or more below the target for 15 consecutive minutes (50 RCUs).How can I speed up my database table? ›
- Tip 1: Optimize Queries. ...
- Tip 2: Improve Indexes. ...
- Tip 3: Defragment Data. ...
- Tip 4: Increase Memory. ...
- Tip 5: Strengthen CPU. ...
- Tip 6: Review Access. ...
- SolarWinds Database Performance Analyzer (DPA) ...
- SolarWinds Database Performance Monitor (DPM)
Since DynamoDB is a NoSQL data model, it handles less structured data more efficiently than a relational data model, which is why it's easier to address query volumes and offers high performance queries for item storage in inconsistent schemas.What is the difference between getItem and Query in DynamoDB? ›
getItem retrieve via hash and range key is a 1:1 fit, the time it takes (hence performance) to retrieve it is limited by the hash and sharding internally. Query results in a search on "all" range keys. It adds computational work, thus considered slower.
Does DynamoDB scale automatically? ›
If you use the AWS Management Console to create a table or a global secondary index, DynamoDB auto scaling is enabled by default. You can modify your auto scaling settings at any time. For more information, see Using the AWS Management Console with DynamoDB auto scaling.How can I speed up query processing? ›
- Tip 4: Use wildcards at the end of a phrase only.
- Tip 5: Avoid too many JOINs.
- Tip 6: Avoid using SELECT DISTINCT.
- Tip 7: Use SELECT fields instead of SELECT *
- Tip 8: Use TOP to sample query results.
- Tip 9: Run the query during off-peak hours.
- Tip 10: Minimize the usage of any query hint.
- 1: Check your database server.
- 2: Improve indexing strategies.
- 3: Identify access to database.
- 4: Evaluate connection capacity.
- 5: Optimize Queries.
- 6: Database Performance Resources.
DynamoDB transactional API operations have the following constraints: A transaction cannot contain more than 100 unique items. A transaction cannot contain more than 4 MB of data. No two actions in a transaction can work against the same item in the same table.