๋ชฉ์ฐจCrawlingํ์ํจํค์ง๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ค์น (๋๋ถ๋ถ ์ค์น๋์ด์๋ค๋ ๊ฐ์ ํ)URL(Uniform Resource Locator)HTTP(Hypertext Transfer Protocol)HTTP request ex 1HTTP request ex 2HTTP Response ex 1HTTP Response ex 2HTTP ์ฒ๋ฆฌ๋ฐฉ์์ํ ์ฝ๋requetsGET ๋ฐฉ์์ผ๋ก parameter ์ ๋ฌํ๋ ๋ฐฉ๋ฒPOST ์์ฒญํ ๋ data ์ ๋ฌ๋ฒํค๋ ์ถ๊ฐ, ์ฟ ํค ์ถ๊ฐ์ธ์ฆ์ถ๊ฐrequests๋ฅผ ์ด์ฉํ ํฌ๋กค๋งํน์ ํ์ด์ง์ ์์ค์ฝ๋๋ฅผ ํ์ผ๋ก ์ ์ฅBeautifulSoupSelector์ฐ์ต๋ฌธ์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ค์น (๋๋ถ๋ถ ์ค์น๋์ด์๋ค๋ ๊ฐ์ ํ)๋ฌธ์ 1๋ฒ๋ฌธ์ 2๋ฒ๋ฌธ์ 3๋ฒ
๋ชฉ์ฐจ
Crawling
์น ํ์ด์ง์ ์ ์ํด์ ์ ๋ณด๋ฅผ ์ฐพ๋ ๊ณผ์ ์ ํ๋ก๊ทธ๋จ์ ํตํด ์ฐพ์ ์์งํ๊ณ ์ํ๋ ํํ์ ๋ง๊ฒ ๊ฐ๊ณตํ๋ ๋ชจ๋ ๊ณผ์ .
- ์ฌ์ดํธ์ ์ด์์์ ์์ฌ์ ๋ฐํ์ง ์์ผ๋ฉด ํฉ๋ฒ์ด๊ณ ๊ทธ๋ ์ง ์์ผ๋ฉด ๋ถ๋ฒ
- ์ฌ์ดํธ ๋๋ ํ ๋ฆฌ์ robots.txtํ์ผ์ ๋ณด๋ฉด ํฌ๋กค๋ง์ ๊ธ์งํ๋์ง ์ํ๋์งํ์๋์ด์์ (Disallow๋ผ๋ ํ์ ์์ผ๋ฉด ํฌ๋กค๋งํ๋ฉด ์ ๋จ)
- ์นํ์ด์ง ์์ค ์ค ์น ํ๋ก๊ทธ๋๋ฐ ์์๋ ์ ์๋ฌผ๋ก ์ธ์ ๋ ์ ์์ผ๋ฏ๋ก ๋ถ๋ฒ ๋ณต์ ๋ ์ ์๊ถ ์นจํด์ ํด๋น.
ํ์ํจํค์ง
- (ํ์) pip3 install BeautifulSoup4 or pip3 install bs4
- (ํ์) pip3 install requests
- (ํ์) pip3 install pandas
- (ํ์) pip3 install plotly
- (์ ํ) pip3 install lxml
๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ค์น (๋๋ถ๋ถ ์ค์น๋์ด์๋ค๋ ๊ฐ์ ํ)
- !pip3 install requests
- !pip3 install beautifulsoup4
!pip3 install requests
!pip3 install beautifulsoup4
# mac , Linux !ls # window !dir
Out[-] C ๋๋ผ์ด๋ธ์ ๋ณผ๋ฅจ์๋ ์ด๋ฆ์ด ์์ต๋๋ค. ๋ณผ๋ฅจ ์ผ๋ จ ๋ฒํธ: CC5E-6766 C:\Users\leehojun\Google ๋๋ผ์ด๋ธ\11_1. ์ฝํ ์ธ ๋์์ ๊ฒฐ๊ณผ๋ฌผ\007. ํฌ๋กค๋ง ๊ฐ์ ๋๋ ํฐ๋ฆฌ 2020-04-03 13:16 <DIR> . 2020-04-03 13:16 <DIR> .. 2020-04-03 13:10 <DIR> .ipynb_checkpoints 2020-04-03 12:18 391,939 001.ipynb 2020-04-03 01:58 <DIR> ์ฐธ๊ณ ์๋ฃ 2020-04-03 13:16 999 ์ต์ข ๊ฐ์์๋ฃ_ํฌ๋กค๋ง.ipynb 2๊ฐ ํ์ผ 392,938 ๋ฐ์ดํธ 4๊ฐ ๋๋ ํฐ๋ฆฌ 13,425,782,784 ๋ฐ์ดํธ ๋จ์
URL(Uniform Resource Locator)
- ์์์ด ์ด๋ ์๋์ง๋ฅผ ์๋ ค์ฃผ๊ธฐ ์ํ ๊ท์ฝ
- ํํ ์น ์ฌ์ดํธ ์ฃผ์๋ก ์๊ณ ์์ง๋ง, URL์ ์น ์ฌ์ดํธ ์ฃผ์๋ฟ๋ง ์๋๋ผ ์ปดํจํฐ ๋คํธ์ํฌ์์ ์์์ ๋ชจ๋ ๋ํ๋ผ ์ ์์
- ๊ทธ ์ฃผ์์ ์ ์ํ๋ ค๋ฉด ํด๋น URL์ ๋ง๋ ํ๋กํ ์ฝ์ ์์์ผ ํ๊ณ , ๊ทธ์ ๋์ผํ ํ๋กํ ์ฝ๋ก ์ ์(FTP ํ๋กํ ์ฝ์ธ ๊ฒฝ์ฐ์๋ FTP ํด๋ผ์ด์ธํธ๋ฅผ ์ด์ฉํด์ผ ํ๊ณ , HTTP์ธ ๊ฒฝ์ฐ์๋ ์น ๋ธ๋ผ์ฐ์ ๋ฅผ ์ด์ฉํด์ผ ํ๋ค. ํ ๋ท์ ๊ฒฝ์ฐ์๋ ํ ๋ท ํ๋ก๊ทธ๋จ์ ์ด์ฉํด์ ์ ์)
์ถ์ฒ : Wiki
HTTP(Hypertext Transfer Protocol)
- HTML, XML, Javascript, ์ค๋์ค, ๋น๋์ค, ์ด๋ฏธ์ง, PDF, Etc
- ์์ฒญ ๋๋ ์ํ ๋ผ์ธ / ํด๋(์๋ต๊ฐ๋ฅ) / ๋น์ค(ํด๋์ ๋) / ๋ฐ๋(์๋ต๊ฐ๋ฅ)
HTTP request ex 1
GET /stock.html HTTP/1.1 Host www.paullab.co.kr
HTTP request ex 2
GET /index.html HTTP/1.1 user-agent: MSIE 6.0; Windows NT 5.0 accept: text/html; */* cookie: name = value referer: http://www.naver.com host: www.paullab.co.kr
- ๋ฐ์ดํฐ ์ฒ๋ฆฌ ๋ฐฉ์, ๊ธฐ๋ณธ ํ์ด์ง, ํ๋กํ ์ฝ ๋ฒ์ .
- User-Agent: ์ฌ์ฉ์ ์น ๋ธ๋ผ์ฐ์ ์ข ๋ฅ ๋ฐ ๋ฒ์ ์ ๋ณด.
- Accept: ์น ์๋ฒ๋ก๋ถํฐ ์์ ๋๋ ๋ฐ์ดํฐ ์ค ์น ๋ธ๋ผ์ฐ์ ๊ฐ ์ฒ๋ฆฌํ ์ ์๋ ๋ฐ์ดํฐ ํ์ ์ ์๋ฏธ.
์ฌ๊ธฐ์ text/html์ text, html ํํ์ ๋ฌธ์๋ฅผ ์ฒ๋ฆฌํ ์ ์๊ณ ,ย /๋ ๋ชจ๋ ๋ฌธ์๋ฅผ ์ฒ๋ฆฌํ ์ ์๋ค๋ ์๋ฏธ. (์ด๋ฅผ MIME ํ์
์ด๋ผ ๋ถ๋ฅด๊ธฐ๋ ํ๋ค.)
- Cookie: HTTP ํ๋กํ ์ฝ ์์ฒด๊ฐ ์ธ์ ์ ์ ์งํ์ง ์๋ State-less(์ ์์ํ๋ฅผ ์ ์งํ์ง ์๋) ๋ฐฉ์์ด๊ธฐ ๋๋ฌธ์ ๋ก๊ทธ์ธ ์ธ์ฆ์ ์ํ ์ฌ์ฉ์ ์ ๋ณด๋ฅผ ๊ธฐ์ตํ๋ ค๊ณ ๋ง๋ ์ธ์์ ์ธ ๊ฐ. ์ฆ ์ฌ์ฉ์๊ฐ ์ ์์ ์ธ ๋ก๊ทธ์ธ ์ธ์ฆ ์ ๋ณด๋ฅผ ๊ฐ์ง๊ณ ์๋ค๋ ๊ฒ์ ํ๋จํ๊ณ ์ ์ฌ์ฉ.
- Referer: ํ์ฌ ํ์ด์ง ์ ์ ์ ์ ์ด๋ ์ฌ์ดํธ๋ฅผ ๊ฒฝ์ ํ๋์ง ์๋ ค์ฃผ๋ ๋๋ฉ์ธ ํน์ URL ์ ๋ณด.
- Host: ์ฌ์ฉ์๊ฐ ์์ฒญํ ๋๋ฉ์ธ์ ๋ณด.
HTTP Response ex 1
HTTP/1.1 200 OK ## ์ํ๋ผ์ธ Content-Type: application/xhtml+xml; charset=utf-8 ## ํด๋ ## ๋น์ค <html> ## ๋ฐ๋ ... </html>
HTTP Response ex 2
HTTP/1.1 OK 200 Server: NCSA/1.4.2 Content-type: text/html Content-length: 107 <html> ... </html>
- ์น ํ๋กํ ์ฝ ๋ฒ์ ๋ฐ ์๋ต ์ฝ๋ ์ ๋ณด๊ฐ ํฌํจ.
- ์น ์ ํ๋ฆฌ์ผ์ด์ ์ข ๋ฅ ๋ฐ ๋ฒ์ ์ ๋ณด๊ฐ ํฌํจ.
- MIME ํ์ ์ ๋ณด๊ฐ ํฌํจ.
- ์์ Body ์ฌ์ด์ฆ ์ ๋ณด๊ฐ ํฌํจ.
- ์ฌ์ฉ์๊ฐ ์์ฒญํ ์น ํ์ด์ง ์ ๋ณด๊ฐ ํฌํจ.
HTTP ์ฒ๋ฆฌ๋ฐฉ์
- GET : ๋ฆฌ์์ค ์ทจ๋ (? ๋ค์ ์ด์ด๋ถ์ด๋ ๋ฐฉ์ - ์์ ๊ฐ๋ค)
- POST : ๋ฆฌ์์ค ์์ฑ (Body์ ๋ถ์ด๋ ๋ฐฉ์ - ์๋์ ์ผ๋ก ํฐ ์ฉ๋)
- PUT : ๋ฆฌ์์ค ์์ ์์ฒญ
- DELETE : ๋ฆฌ์์ค ์ญ์ ์์ฒญ
- HEAD : HTTP ํค๋ ์ ๋ณด๋ง ์์ฒญ, ํด๋น ์์ ์กด์ฌ ์ฌ๋ถ ํ์ธ ๋ชฉ์
- OPTIONS : ์น์๋ฒ๊ฐ ์ง์ํ๋ ๋ฉ์๋ ์ข ๋ฅ ๋ฐํ ์์ฒญ
- TRACE : ์์ฒญ ๋ฆฌ์์ค๊ฐ ์์ ๋๋ ๊ฒฝ๋ก ํ์ธ
- CONNECT : ์์ฒญ ๋ฆฌ์์ค์ ๋ํด ์๋ฐฉํฅ ์ฐ๊ฒฐ ์์
์ํ ์ฝ๋
- 200 : ์๋ฒ๊ฐ ์์ฒญ์ ์ ๋๋ก ์ฒ๋ฆฌ.
- 201 : ์ฑ๊ณต์ ์ผ๋ก ์์ฒญ๋์์ผ๋ฉฐ ์๋ฒ๊ฐ ์ ๋ฆฌ์์ค๋ฅผ ์์ฑ.
- 202 : ์๋ฒ๊ฐ ์์ฒญ์ ์ ์ํ์ง๋ง ์์ง ์ฒ๋ฆฌํ์ง ์์.
- 301 : ์์ฒญํ ํ์ด์ง๋ฅผ ์ ์์น๋ก ์๊ตฌ์ ์ผ๋ก ์ด๋.
- 403 : ์๋ฒ๊ฐ ์์ฒญ์ ๊ฑฐ๋ถ.
- 404 : ์๋ฒ๊ฐ ์์ฒญํ ํ์ด์ง๋ฅผ ์ฐพ์ ์ ์์.
- 500 : ์๋ฒ์ ์ค๋ฅ๊ฐ ๋ฐ์ํ์ฌ ์์ฒญ์ ์ํํ ์ ์์.
- 503 : ์๋ฒ๊ฐ ์ค๋ฒ๋ก๋๋์๊ฑฐ๋ ์ ์ง๊ด๋ฆฌ๋ฅผ ์ํด ๋ค์ด๋์๊ธฐ ๋๋ฌธ์ ํ์ฌ์๋ฒ ์ฌ์ฉ ๋ถ๊ฐ.
์ถ์ฒ : WIKI
ย
import requests import bs4
requests.__version__ # requests ๋ฒ์ ํ์ธ
Out[-] '2.22.0'
bs4.__version__ # bs4 ๋ฒ์ ํ์ธ
Out[-] '4.7.1'
from datetime import datetime datetime.now() # ํ์ฌ ์๊ฐ ์ถ๋ ฅ
Out[-] datetime.datetime(2020, 4, 3, 13, 36, 59, 789815)
requets
- HTTP ์์ฒญ์ ๋ณด๋ด๋๋ฐ ์ฌ์ฉํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- .text : strํ์ ์ ๋ฐ์ดํฐ๋ฅผ return
- .headers : header(key/value ํ์์ผ๋ก ๋ฐ์ดํฐ ์ ์ฅ)์ ๋ด์ฉ ํ์ธ
- .encoding : ์ธ์ฝ๋ฉ ๋ฐฉ์ ํ์ธ
- .status_code : HTTP ์์ฒญ์ ๋ํด์ ์์ฒญ์ด ์ฑ๊ณตํ๋์ง ์คํจํ๋์ง ํน์ ์ด๋ค ์ํ์ธ์ง ๋งํด์ค
- .ok : ๋ฐ์ดํฐ๋ฅผ ์ ๋ถ๋ฌ์ค๊ณ ์๋์ง ํ์ธ
import requests html = requests.get('http://www.paullab.co.kr/stock.html') html
Out[-] <Response [200]> # ์๋ฒ๊ฐ ์์ฒญ์ ์ ๋๋ก ์ฒ๋ฆฌ
html.text # ํ๊ธ ๊นจ์ง๋ ํ์์ด ๋ฐ์
Out[-] '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n <meta charset="UTF-8">\n <meta name="viewport" content="width=device-width, initial-scale=1.0">\n <meta http-equiv="X-UA-Compatible" content="ie=edge">\n <title>Document</title>\n <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css">\n <style>\n h1{\n margin: 2rem;\n }\n h1>span{\n font-size: 1rem;\n }\n .main {\n width: 70%;\n margin: 3rem auto auto auto;\n text-align: center\n }\n\n table {\n width: 100%;\n }\n </style>\n</head>\n\n<body>\n <h1>รญ\x81ยฌรซยกยครซยง\x81 รฌ\x97ยฐรฌ\x8aยตรฌ\x9aยฉ รญ\x8e\x98รฌ\x9dยดรฌยง\x80 ... <td class="num"><span>139,085</span></td>\n </tr>\n </tbody>\n </table>\n </div>\n</body>\n\n</html>\n'
html.headers
Out[-] {'Server': 'nginx', 'Date': 'Sat, 04 Apr 2020 11:13:06 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'P3P': "CP='NOI CURa ADMa DEVa TAIa OUR DELa BUS IND PHY ONL UNI COM NAV INT DEM PRE'", 'X-Powered-By': 'PHP/5.5.17p1', 'Content-Encoding': 'gzip'}
dir(html)
Out[-] ['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', ... 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
#ASCII ๊ธฐ๋ฐ์ ํ์ฅ ์ธ์ฝ๋ฉ ๋ฐฉ์ html.encoding
Out[-] 'ISO-8859-1'
html.encoding = 'utf-8' # ํ๊ธ ์ถ๋ ฅ
html.text
Out[-] '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n <meta charset="UTF-8">\n <meta name="viewport" content="width=device-width, initial-scale=1.0">\n <meta http-equiv="X-UA-Compatible" content="ie=edge">\n <title>Document</title>\n <link rel="stylesheet" href= "https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"> \n <style>\n h1{\n margin: 2rem;\n }\n h1>span{\n font-size: 1rem;\n }\n .main {\n width: 70%;\n margin: 3rem auto auto auto;\n text-align: center\n }\n\n table {\n width: 100%;\n }\n </style>\n</head>\n\n<body>\n <h1>ํฌ๋กค๋ง ์ฐ์ต์ฉ ํ์ด์ง ... <td class="num"><span>139,085</span></td>\n </tr>\n </tbody>\n </table>\n </div>\n</body>\n\n</html>\n'
html.status_code
Out[-] 200 # 200 : ์ฑ๊ณตํ๋ค๋ ์๋ฏธ
html.ok
Out[-] True
ย
GET ๋ฐฉ์์ผ๋ก parameter ์ ๋ฌํ๋ ๋ฐฉ๋ฒ
<html> <head> </head> <body> <form action="test.html" method="GET"> <input type="text" name="user_id"> <input type="password" name="user_pw"> <input type="submit" name="submit"> </form> </body> </html>
params = {'pa1': 'val1', 'pa2': 'value2'} response = requests.get('http://www.paullab.co.kr', params=params)
response.url
Out[-] 'http://www.paullab.co.kr/?pa1=val1&pa2=value2'
ย
POST ์์ฒญํ ๋ data ์ ๋ฌ๋ฒ
import requests, json data = {'pa1': 'val1', 'pa2': 'value2'} response = requests.post('http://www.paullab.co.kr', data=json.dumps(data))
ํค๋ ์ถ๊ฐ, ์ฟ ํค ์ถ๊ฐ
headers = {'Content-Type': 'application/json; charset=utf-8'} cookies = {'session_id': 'sorryidontcare'} response = requests.get('http://www.paullab.co.kr', headers=headers, cookies=cookies)
์ธ์ฆ์ถ๊ฐ
response = requests.post('http://www.paullab.co.kr', auth=("id","pass"))
ย
requests๋ฅผ ์ด์ฉํ ํฌ๋กค๋ง
import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser') # ์ํ๋ ๋ฌธ์์ด๋ก ์๋ผ์ค
print(soup.prettify()) # html ๋ฌธ์ํ์์ผ๋ก ์ถ๋ ฅ
Out[-] <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1.0" name="viewport"/> <meta content="ie=edge" http-equiv="X-UA-Compatible"/> <title> Document </title> ... <td class="num"> <span> 139,085 </span> </td> </tr> </tbody> </table> </div> </body> </html>
ย
ํน์ ํ์ด์ง์ ์์ค์ฝ๋๋ฅผ ํ์ผ๋ก ์ ์ฅ
import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text # url ์ฝ๋๋ฅผ ํ์ผ๋ก ์ ์ฅ f = open('test.html', 'w', encoding='utf-8') f.write(html) f.close()
!dir
Out[-] C ๋๋ผ์ด๋ธ์ ๋ณผ๋ฅจ์๋ ์ด๋ฆ์ด ์์ต๋๋ค. ๋ณผ๋ฅจ ์ผ๋ จ ๋ฒํธ: CC5E-6766 C:\Users\leehojun\Google ๋๋ผ์ด๋ธ\11_1. ์ฝํ ์ธ ๋์์ ๊ฒฐ๊ณผ๋ฌผ\007. ํฌ๋กค๋ง ๊ฐ์ ๋๋ ํฐ๋ฆฌ 2020-04-03 13:55 <DIR> . 2020-04-03 13:55 <DIR> .. 2020-04-03 13:10 <DIR> .ipynb_checkpoints 2020-04-03 13:50 392,338 001.ipynb 2020-04-03 13:55 48,527 test.html # test.html์ด ์์ฑ๋๋๊ฒ์ ํ์ธ 2020-04-03 01:58 <DIR> ์ฐธ๊ณ ์๋ฃ 2020-04-03 13:54 221,527 ์ต์ข ๊ฐ์์๋ฃ_ํฌ๋กค๋ง.ipynb 3๊ฐ ํ์ผ 662,392 ๋ฐ์ดํธ 4๊ฐ ๋๋ ํฐ๋ฆฌ 13,301,800,960 ๋ฐ์ดํธ ๋จ์
# url ํ์ผ์์ ํน์ ๋จ์ด ์ฐพ๊ธฐ s = html.split(' ') # ๋์ด์ฐ๊ธฐ ๋จ์๋ก ๋ถํ # ์๋ค๋ก ๋์ด์ฐ๊ธฐ ์๋์ด ์์ผ๋ฉด ๊ฒ์์ด ์๋จ word = input('ํ์ด์ง์์ ๊ฒ์ํ ๋จ์ด๋ฅผ ์ ๋ ฅํ์ธ์ : ') s.count(word)
Out[-] ํ์ด์ง์์ ๊ฒ์ํ ๋จ์ด๋ฅผ ์ ๋ ฅํ์ธ์ : ์ ์ฃผ 0
ย
BeautifulSoup
- strํ์ ์ html ๋ฐ์ดํฐ๋ฅผ html ๊ตฌ์กฐ๋ฅผ ๊ฐ์ง ๋ฐ์ดํฐ๋ก ๊ฐ๊ณตํด์ฃผ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- BeautifulSoup(markup, "html.parser")
- BeautifulSoup(markup, "lxml")
- BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")
- BeautifulSoup(markup, "html5lib")
import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
soup.title # title ํ๊ทธ ์ถ๋ ฅ
Out[-] <title>Document</title>
soup.title.string # titleํ๊ทธ์์ ๋ฌธ์์ด๋ง ์ถ๋ ฅ
Out[-] 'Document'
soup.title.text # String ๊ฐ์ ๊ธฐ๋ฅ
Out[-] 'Document'
soup.title.parent.name # ๋ถ๋ชจ ํ๊ทธ
Out[-] 'head'
soup.tr # table row
Out[-] <tr> <th scope="col">๋ ์ง</th> <th scope="col">์ข ๊ฐ</th> <th scope="col">์ ์ผ๋น</th> <th scope="col">์๊ฐ</th> <th scope="col">๊ณ ๊ฐ</th> <th scope="col">์ ๊ฐ</th> <th scope="col">๊ฑฐ๋๋</th> </tr>
soup.td # table data
Out[-] <td align="center "><span class="date">2019.10.23</span></td>
soup.th # table header cell
Out[-] <th scope="col">๋ ์ง</th>
soup.table
Out[-] <table class="table table-hover"> <tbody> <tr> <th scope="col">๋ ์ง</th> <th scope="col">์ข ๊ฐ</th> <th scope="col">์ ์ผ๋น</th> <th scope="col">์๊ฐ</th> <th scope="col">๊ณ ๊ฐ</th> <th scope="col">์ ๊ฐ</th> <th scope="col">๊ฑฐ๋๋</th> </tr> <tr> <td align="center "><span class="date">2019.10.23</span></td> <td class="num"><span>6,650</span></td> ... </td> <td class="num"><span>5,300</span></td> <td class="num"><span>5,370</span></td> <td class="num"><span>5,280</span></td> <td class="num"><span>211,019</span></td> </tr> </tbody> </table>
soup.find('title') # find() : ์กฐ๊ฑด์ ๋ง๋ ํ๋์ ํ๊ทธ๋ฅผ ์ถ๋ ฅ
Out[-] <title>Document</title>
soup.find('tr')
Out[-] <tr> <th scope="col">๋ ์ง</th> <th scope="col">์ข ๊ฐ</th> <th scope="col">์ ์ผ๋น</th> <th scope="col">์๊ฐ</th> <th scope="col">๊ณ ๊ฐ</th> <th scope="col">์ ๊ฐ</th> <th scope="col">๊ฑฐ๋๋</th> </tr>
soup.find('th')
Out[-] <th scope="col">๋ ์ง</th>
soup.find(id=('update')).text # ํน์ id์ text ์ถ๋ ฅ
Out[-] 'update : 20.12.30'
soup.find('head').find('title') # head ์์ title ์ถ๋ ฅ
Out[-] <title>Document</title>
soup.find('h2', id='์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ์ฐ๊ตฌ์') # h2์ 'id๊ฐ ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ์ฐ๊ตฌ์'๋ฅผ ์ถ๋ ฅ
Out[-] <h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ์ฐ๊ตฌ์">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์</h2>
soup.find_all('h2') # find_all() : ์กฐ๊ฑด์ ๋ง๋ ๋ชจ๋ ํ๊ทธ๋ค์ ์ถ๋ ฅ
Out[-] [<h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ์ฐ๊ตฌ์">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์</h2>, <h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ๊ณต์ ">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ๊ณต์ </h2>, <h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ์ถํ์ฌ">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ถํ์ฌ</h2>, <h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํํ์">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ํ์</h2>]
soup.find_all('h2')[0]
Out[-] <h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ์ฐ๊ตฌ์">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์</h2>
soup.find_all('table', class_='table') # class_ : ์์ฝ์ด # ์์ฝ์ด : ํน์ ํ ๊ธฐ๋ฅ์ ์ํํ๋๋ก ๋ฏธ๋ฆฌ ์์ฝ๋์ด ์๋๊ฒ
Out[-] [<table class="table table-hover"> <tbody> <tr> <th scope="col">๋ ์ง</th> <th scope="col">์ข ๊ฐ</th> <th scope="col">์ ์ผ๋น</th> <th scope="col">์๊ฐ</th> <th scope="col">๊ณ ๊ฐ</th> <th scope="col">์ ๊ฐ</th> <th scope="col">๊ฑฐ๋๋</th> </tr> <tr> <td align="center "><span class="date">2019.10.23</span></td> <td class="num"><span>6,650</span></td> ... </td> <td class="num"><span>2,020</span></td> <td class="num"><span>2,090</span></td> <td class="num"><span>2,020</span></td> <td class="num"><span>139,085</span></td> </tr> </tbody> </table>]
soup = BeautifulSoup(''' <hojun id='jeju' class='codingBaseCamp codingLevelUp'> hello world </hojun> ''') # tag = hojun , id = 'jeju' , class = 'codingBaseCamp codingLevelUp' tag = soup.hojun tag
Out[-] <hojun class="codingBaseCamp codingLevelUp" id="jeju"> hello world </hojun>
type(tag)
Out[-] bs4.element.Tag
dir(tag) # tag์ method
Out[-] ['HTML_FORMATTERS', 'XML_FORMATTERS', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', ... 'setup', 'string', 'strings', 'stripped_strings', 'text', 'unwrap', 'wrap']
tag.name
Out[-] 'hojun'
tag['class']
Out[-] ['codingBaseCamp', 'codingLevelUp']
tag['id']
Out[-] 'jeju'
tag.attrs # ์ ๋ณด๋ฅผ ํ๋ฒ์ ๋ณด๊ณ ์ถ์๋ ์ฌ์ฉ
Out[-] {'id': 'jeju', 'class': ['codingBaseCamp', 'codingLevelUp']}
tag.string # ๋ฌธ์์ด ์ถ๋ ฅ
Out[-] '\n hello world\n'
tag.text # ๋ฌธ์์ด ์ถ๋ ฅ
Out[-] '\n hello world\n'
tag.contents # list๋ก ์ถ๋ ฅ
Out[-] ['\n hello world\n']
for i in tag.children: # children : ์ข ์ ํ๊ทธ print(i)
Out[-] hello world
tag.children
Out[-] <list_iterator at 0x1cd52f37550>
soup = BeautifulSoup(''' <ul> <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li> <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li> <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li> </ul> ''') tag = soup.ul tag
Out[-] <ul> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> </ul>
tag.contents # list
Out[-] ['\n', <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>, '\n', <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>, '\n', <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>, '\n']
tag.li # li ํ๊ทธ ์ถ๋ ฅ
Out[-] <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
tag.li.parent # li ํ๊ทธ์ ๋ถ๋ชจ ํ๊ทธ ์ถ๋ ฅ
Out[-] <ul> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> </ul>
ย
Selector
- ํ๊ทธ์ ์ข ๋ ์ธ๋ฐํ ์ ๊ทผ์ด ๊ฐ๋ฅ
- class๋ฅผ ์ง์นญํ ๋๋ '.'์ ์ฌ์ฉํ๊ณ , id๋ฅผ ์ง์นญํ ๋๋ '#'๋ฅผ ์ฌ์ฉ
- ํ์ํ๊ณ ์ ํ๋ ํ๊ทธ๊ฐ ํน์ ํ๊ทธ ํ์์ ์์ ๋ '>'๋ฅผ ์ฌ์ฉ
import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
soup.select('#update')
Out[-] [<span id="update">update : 20.12.30</span>]
soup.select('.table > tr') # 'table' class ์์ ๋ชจ๋ tr ํ๊ทธ ์ถ๋ ฅ # ์์ : table > tbody > tr (๋ฐ๋ก ์๋ ์๋๋ฉด ์คํ์๋จ)
Out[-] []
soup.select('.table > tbody > tr')[2] # 'table' class ์์ tbody ์์ ๋ชจ๋ tr ํ๊ทธ ์ถ๋ ฅ
Out[-] <tr> <td align="center"><span class="date">2019.10.22</span></td> <td class="num"><span>6,630</span></td> <td class="num"> <img alt="ํ๋ฝ" height="6" src="ico_down.gif" style="margin-right:4px;" width="7"/> <span class="tah p11 nv01"> 190 </span> </td> <td class="num"><span>6,830</span></td> <td class="num"><span>6,930</span></td> <td class="num"><span>6,530</span></td> <td class="num"><span>919,571</span></td> </tr>
# ์์ ์ ํ ๋ฐฉ๋ฒ soup.select("p > a:nth-of-type(2)") # p > a tag ์ธ๋ฐ 2๋ฒ์งธ ์์ soup.select("p > a:nth-child(even)") # p > a tag ์ธ๋ฐ ์ง์๋ ํ์๋ฒ์งธ ์์ soup.select('a[href]') # ํน์ attribute ์์ soup.select("#link1 + .sister") # id์ class๋ฅผ ๋์์ ๊ฐ์ง ์์
oneStep = soup.select('.main')[0] # ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์ oneStep
Out[-] <div class="main"> <h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ์ฐ๊ตฌ์">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์</h2> <h3><span style="color: salmon">์ผ๋ณ</span> ์์ธ</h3> <table class="table table-hover"> <tbody> <tr> <th scope="col">๋ ์ง</th> <th scope="col">์ข ๊ฐ</th> <th scope="col">์ ์ผ๋น</th> <th scope="col">์๊ฐ</th> <th scope="col">๊ณ ๊ฐ</th> <th scope="col">์ ๊ฐ</th> <th scope="col">๊ฑฐ๋๋</th> </tr> ... <td class="num"><span>5,300</span></td> <td class="num"><span>5,370</span></td> <td class="num"><span>5,280</span></td> <td class="num"><span>211,019</span></td> </tr> </tbody> </table> </div>
twoStep = oneStep.select('tbody > tr')[1:] twoStep
Out[-] <tr> <td align="center "><span class="date">2019.10.23</span></td> <td class="num"><span>6,650</span></td> <td class="num"> <img alt="์์น " height="6 " src="ico_up.gif " style="margin-right:4px; " width="7 "/> <span> 20 </span> ... 10 </span> </td> <td class="num"><span>5,300</span></td> <td class="num"><span>5,370</span></td> <td class="num"><span>5,280</span></td> <td class="num"><span>211,019</span></td> </tr>]
twoStep[0].select('td')[0].text # ๋ ์ง
Out[-] '2019.10.23'
twoStep[0].select('td')[1].text # ์ข ๊ฐ
Out[-] '6650' # ๋ฌธ์ํ์ด๊ธฐ๋๋ฌธ์ ๋ง์ฝ ๊ณ์ฐ์ ์ด์ฉํ๊ฒ ๋๋ค๋ฉด ์ซ์ํ์ผ๋ก ๋ฐ๊ฟ์ค์ผํจ. # twoStep[0].select('td')[1].text.replace(',', '')
๋ ์ง = [] ์ข ๊ฐ = [] for i in twoStep: ๋ ์ง.append(i.select('td')[0].text) ์ข ๊ฐ.append(int(i.select('td')[1].text.replace(',', '')))
๋ ์ง
Out[-] ['2019.10.23', '2019.10.22', '2019.10.21', '2019.10.18', '2019.10.17', '2019.10.16', '2019.10.15', '2019.10.14', '2019.10.11', '2019.10.10', '2019.10.08', '2019.10.07', '2019.10.04', '2019.10.02', '2019.10.01', '2019.09.30', '2019.09.27', '2019.09.26', '2019.09.25', '2019.09.24']
์ข ๊ฐ
Out[-] [6650, 6630, 6820, 6430, 5950, 5930, 5640, 5380, 5040, 5100, 5050, 4940, 5010, 4920, 5010, 5000, 5010, 5060, 5060, 5330]
# ์๊ฐํ # ๋ ์ง๋ณ๋ก ๊ฐ๊ฒฉ ๋ณ๋ ์ถ์ด import plotly.express as px fig = px.line(x=๋ ์ง, y=์ข ๊ฐ, title='jejucodingcamp') fig.show()
Out[-]
ย
์ฐ์ต๋ฌธ์
๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ค์น (๋๋ถ๋ถ ์ค์น๋์ด์๋ค๋ ๊ฐ์ ํ)
- !pip3 install requests
- !pip3 install beautifulsoup4
- ํฌ๋กค๋ง URL :ย http://www.paullab.co.kr/stock.html
๋ฌธ์ 1๋ฒ
๊ฐ ํ์ฌ๋ณ 1๋ง์ฃผ์ฉ ์๋ค๊ณ ๊ฐ์ ํ์ ๋, ์ ๊ทธ๋ฃน์ฌ ์๊ฐ์ด์ก์ ๊ตฌํด์ฃผ์ธ์.
- ๊ทธ๋ฃน์ฌ : [ ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์, ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ๊ณต์ , ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ถํ์ฌ, ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ํ์]
import requests from bs4 import BeautifulSoup response = requests.get("http://www.paullab.co.kr/stock.html") response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
soup.select('.main')[0] # ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์ soup.select('.main')[1] # ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ๊ณต์ soup.select('.main')[2] # ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ถํ์ฌ soup.select('.main')[3] # ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ํ์
Out[-] <div class="main"> <h2 id="์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํํ์">์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ํ์</h2> <h3><span style="color: salmon">์ผ๋ณ</span> ์์ธ</h3> <table class="table table-hover"> <tbody> <tr> <th scope="col">๋ ์ง</th> <th scope="col">์ข ๊ฐ</th> <th scope="col">์ ์ผ๋น</th> <th scope="col">์๊ฐ</th> <th scope="col">๊ณ ๊ฐ</th> <th scope="col">์ ๊ฐ</th> <th scope="col">๊ฑฐ๋๋</th> </tr> ... <td class="num"><span>2,020</span></td> <td class="num"><span>2,090</span></td> <td class="num"><span>2,020</span></td> <td class="num"><span>139,085</span></td> </tr> </tbody> </table> </div>
๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ = soup.select('.main') ์ค๋์ข ๊ฐ = [] ์ค๋์๊ฐ์ด์ก = [] for i in ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ: print(i.select('.table > tbody > tr')[1].select('td')[1]) print(i.select('.table > tbody > tr')[1].select('td')[1].text) print(i.select('.table > tbody > tr')[1].select('td')[1].text.replace(',', ''))
Out[-] <td class="num"><span>6,650</span></td> # 10.23์ผ, ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ฐ๊ตฌ์, ์ข ๊ฐ 6,650 6650 <td class="num"><span>31,300</span></td> # 10.23์ผ, ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ๊ณต์ , ์ข ๊ฐ 31,300 31300 <td class="num"><span>13,250</span></td> # 10.23์ผ, ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ์ถํ์ฌ, ์ข ๊ฐ 13,250 13250 <td class="num"><span>2,600</span></td> # 10.23์ผ, ์ ์ฃผ์ฝ๋ฉ๋ฒ ์ด์ค์บ ํ ํ์, ์ข ๊ฐ 2,600 2600
๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ = soup.select('.main') ์ค๋์ข ๊ฐ = [] ์ค๋์๊ฐ์ด์ก = [] for i in ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ: ์ค๋์ข ๊ฐ.append(int(i.select('.table > tbody > tr')[1].select('td')[1]. select('td > span')[0].text.replace(',', ''))) print(์ค๋์ข ๊ฐ)
Out[-] [6650, 31300, 13250, 2600]
์ค๋์๊ฐ์ด์ก = [i*10000 for i in ์ค๋์ข ๊ฐ] ์ ๊ทธ๋ฃน์ฌ์๊ฐ์ด์ก = format(sum(์ค๋์๊ฐ์ด์ก), ',') ์ ๊ทธ๋ฃน์ฌ์๊ฐ์ด์ก
Out[-] '538,000,000'
ย
๋ฌธ์ 2๋ฒ
์ ๊ทธ๋ฃน์ฌ ์๊ฐ์ด์ก ์ถ์ด๋ฅผ ๊ทธ๋ํ๋ก ๊ทธ๋ ค์ฃผ์ธ์. x์ถ์ ๋ ์ง, y์ถ์ ๊ฐ๊ฒฉ์
๋๋ค.
# ๊ฐ๊ทธ๋ฃน์ฌ์ ์ผ์ผ ์๊ฐ์ด์ก์ ๊ตฌํจ ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ = soup.select('.main') ์ค๋์ข ๊ฐ = [] ์ค๋์๊ฐ์ด์ก = [] for i in ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ: ์ค๋์ข ๊ฐ.append(int(i.select('.table > tbody > tr')[1].select('td')[1]. select('td > span')[0].text.replace(',', '')))
๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ = soup.select('.main') ์ค๋์ข ๊ฐ = [] ์ค๋์๊ฐ์ด์ก = [] for j in range(1, len(soup.select('.main')[0].select('table > tbody > tr'))): ์ค๋์ข ๊ฐ = [] for i in ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ์๊ฐ: ์ค๋์ข ๊ฐ.append(int(i.select('.table > tbody > tr')[j].select('td')[1]. select('td > span')[0].text.replace(',', ''))) ์ค๋์๊ฐ์ด์ก.append(sum(์ค๋์ข ๊ฐ))
์ค๋์๊ฐ์ด์ก
Out[-] [53800, 53180, 53615, 52305, 49035, 48755, 46970, 46140, 45900, 45765, 44000, 43210, 43830, 44310, 44850, 44370, 43935, 44180, 44410, 46245]
# ๋ ์ง table ํฌ๋กค๋ง ๋ ์ง์ ์ฒด = soup.select('.main')[0].select('.table > tbody > tr > td > .date') date = [] for i in ๋ ์ง์ ์ฒด: date.append(i.text) date
Out[-] ['2019.10.23', '2019.10.22', '2019.10.21', '2019.10.18', '2019.10.17', '2019.10.16', '2019.10.15', '2019.10.14', '2019.10.11', '2019.10.10', '2019.10.08', '2019.10.07', '2019.10.04', '2019.10.02', '2019.10.01', '2019.09.30', '2019.09.27', '2019.09.26', '2019.09.25', '2019.09.24']
%matplotlib inline import matplotlib.pyplot as plt plt.plot(date, ์ค๋์๊ฐ์ด์ก) plt.xticks(rotation = -45 ) # y ์ถ ๋ณ์ ๊ธฐ์ธ๊ธฐ ์ค์ plt.show()
Out[-]
%matplotlib inline # ๋ ์ง์์ ์ ๋ ฌํ์ฌ ์ฌ์ถ๋ ฅ import matplotlib.pyplot as plt plt.plot(date[::-1], ์ค๋์๊ฐ์ด์ก[::-1]) plt.xticks(rotation = -45 ) plt.show()
Out[-]
import requests from bs4 import BeautifulSoup response = requests.get("http://www.paullab.co.kr/stock.html") response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
Out[-] [538000000, 531800000, 536150000, 523050000, 490350000, 487550000, 469700000, 461400000, 459000000, 457650000, 440000000, 432100000, 438300000, 443100000, 448500000, 443700000, 439350000, 441800000, 444100000, 462450000]
# ๋ ์ง table ํฌ๋กค๋ง ๋ ์ง = soup.select('.main')[0].select('.table > tbody > tr > td > .date') date = [] for i in ๋ ์ง: date.append(i.text) date
Out[-] ['2019.10.23', '2019.10.22', '2019.10.21', '2019.10.18', '2019.10.17', '2019.10.16', '2019.10.15', '2019.10.14', '2019.10.11', '2019.10.10', '2019.10.08', '2019.10.07', '2019.10.04', '2019.10.02', '2019.10.01', '2019.09.30', '2019.09.27', '2019.09.26', '2019.09.25', '2019.09.24']
# ์๊ฐํ %matplotlib inline import matplotlib.pyplot as plt plt.plot(date[::-1], ์ค๋์๊ฐ์ด์ก[::-1]) plt.xticks(rotation = -45) plt.show()
Out[-]
๋ฌธ์ 3๋ฒ
๊ฐ ํ์ฌ๋ณ ๊ฑฐ๋ ์ด๋๊ณผ ์ ๊ทธ๋ฃน์ฌ ๊ฑฐ๋ ์ด๋์ subplot์ผ๋ก ๊ทธ๋ ค์ฃผ์ธ์.
๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๋ฐ์ดํฐ = soup.select('.main') ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋ = [[],[],[],[]] ๊ทธ๋ฃน์ฌ์ ์ฒด์ผ์ผ๊ฑฐ๋๋ = [] # ๋ฐ์ดํฐ ๊ตฌ์กฐ : # ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋ = [[์ถํ์ฌ], [์ฐ๊ตฌ์], [๊ณต์ ์ฌ], [ํ์]]
๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๋ฐ์ดํฐ[0].select('.table > tbody > tr')[0] ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๋ฐ์ดํฐ[0].select('.table > tbody > tr')[1].select('td')[-1].text.replace(',','')
Out[-] '398421'
for j in range(1, len(soup.select('.main')[0].select('table > tbody > tr'))): ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[0].append(int(๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๋ฐ์ดํฐ[0].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', ''))) ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[1].append(int(๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๋ฐ์ดํฐ[1].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', ''))) ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[2].append(int(๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๋ฐ์ดํฐ[2].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', ''))) ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[3].append(int(๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๋ฐ์ดํฐ[3].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', '')))
๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[0] # ์ถํ์ฌ ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[1] # ์ฐ๊ตฌ์ ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[2] # ๊ณต์ ์ฌ ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[3] # ํ์ len(๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[0]) ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[0]
Out[-] [398421, 919571, 1678055, 2168857, 1982922, 839434, 702104, 764800, 134558, 288563, 223839, 199580, 188467, 160510, 246145, 705046, 408859, 404633, 441923, 211019]
%matplotlib inline import matplotlib.pyplot as plt plt.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[0][::-1], label='A') plt.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[1][::-1], label='B') plt.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[2][::-1], label='C') plt.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[3][::-1], label='D') plt.xticks(rotation = -45 ) plt.legend(loc=2) plt.show()
Out[-]
for i in range(len(๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[0])): s = 0 for j in range(4): s += ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[j][i] ๊ทธ๋ฃน์ฌ์ ์ฒด์ผ์ผ๊ฑฐ๋๋.append(s) ๊ทธ๋ฃน์ฌ์ ์ฒด์ผ์ผ๊ฑฐ๋๋
Out[-] [3198301, 2051067, 3724291, 4286651, 3167249, 2477184, 1456343, 1174487, 771938, 1463947, 698527, 673095, 562816, 650582, 784490, 1239662, 872050, 868624, 1115164, 803201]
%matplotlib inline import matplotlib.pyplot as plt plt.plot(date[::-1], ๊ทธ๋ฃน์ฌ์ ์ฒด์ผ์ผ๊ฑฐ๋๋[::-1], label='ALL') plt.xticks(rotation = -45 ) plt.legend(loc=2) plt.show()
Out[-]
f = plt.figure(figsize=(10,3)) # 1๋ฒ ๊ทธ๋ฆผ (๊ทธ๋ฃน์ฌ๋ณ) ax = f.add_subplot(121) ax.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[0][::-1], label='A') ax.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[1][::-1], label='B') ax.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[2][::-1], label='C') ax.plot(date[::-1], ๊ทธ๋ฃน์ฌ๋ณ์ผ์ผ๊ฑฐ๋๋[3][::-1], label='D') plt.xticks(rotation = -45) ax.legend(loc=2) # 2๋ฒ ๊ทธ๋ฆผ (์ ์ฒด) ax2 = f.add_subplot(122) ax2.figsize=(15,15) ax2.plot(date[::-1], ๊ทธ๋ฃน์ฌ์ ์ฒด์ผ์ผ๊ฑฐ๋๋[::-1], label='ALL') plt.xticks(rotation = -45) ax2.legend(loc=2)
Out[-]
ย