1. 一个有10亿条记录的文本文件,已按照关键字排好序存储,设计算法,可以快速的从文件中查找指定关键字的记录。
$10亿=10^9 \approx 2^{30}$,每行记录如果是1kB的话,总共是1TB。将文件分割成1000份,每份1G,load进内存作二分查找即可。
2. 设计一个分布式爬虫系统。
配置参数: start_url, 爬取的深度, update的频率.
功能: 定时爬取更新, 去重, 检索; 是否支持规则;
问题: 分布式存储, 怎么去重, 磁盘io和网络io; 重爬. 数据失效后,更新索引;
一开始要估计好量吧,比如一个页面有100个链接,4层的话就有100^4,每个页面是100kB的话,每次爬取就可能有10TB数据. 怎么去重. 假设有50%去重了,也就是5TB.
假设有20%需要定时更新,那么update的量就有1TB.
http://blog.sina.com.cn/s/blog_59c4ac5501017wda.html
http://www.douban.com/group/topic/38361104/
3. 设计一个长连接手机云推送服务。怎么做链接管理(链接中断、链接查找),百万级长连接,怎么做容错。
4. news feeds。
5. 分布式缓存方案。
系统设计的时候,我觉得知道以下几点会有好处:
- 水平扩展和垂直扩展;
- 多读还是多写;
- 负载均衡;
- dns;BIND is by far the most widely used DNS software on the Internet, providing a robust and stable platform on top of which organizations can build distributed computing systems with the knowledge that those systems are fully compliant with published DNS standards.
- 缓存,以及缓存系统会出现的雪崩现象(一旦缓存失效需要从数据库重新加载数据的时候,大量的并发数据库访问会导致响应超级慢),这里有个不错,双缓存;工作中也只是充当“有很多数据结构”的Memcached来使用。。。(Memcached作为数据库一级缓存,Redis作为业务场景二级缓存)
- Nginx(发音同engine x);在Linux操作系统下,nginx使用epoll事件模型;
- 数据恢复;日志是个好帮手;
A load balancer is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. Load balancers are used to increase capacity (concurrent users) and reliability of applications. They improve the overall performance of applications by decreasing the burden on servers associated with managing and maintaining application and network sessions, as well as by performing application-specific tasks.
Load balancers are generally grouped into two categories: Layer 4 and Layer 7. Layer 4 load balancers act upon data found in network and transport layer protocols (IP, TCP, FTP, UDP). Layer 7 load balancers distribute requests based upon data found in application layer protocols such as HTTP.
Requests are received by both types of load balancers and they are distributed to a particular server based on a configured algorithm. Some industry standard algorithms are:
Round robin
Weighted round robinLeast connectionsLeast response timeLayer 7 load balancers can further distribute requests based on application specific data such as HTTP headers, cookies, or data within the application message itself, such as the value of a specific parameter.Load balancers ensure reliability and availability by monitoring the "health" of applications and only sending requests to servers and applications that can respond in a timely manner.