๐Ÿ•›

Crawling

๋ชฉ์ฐจ

Crawling

์›น ํŽ˜์ด์ง€์— ์ ‘์†ํ•ด์„œ ์ •๋ณด๋ฅผ ์ฐพ๋Š” ๊ณผ์ •์„ ํ”„๋กœ๊ทธ๋žจ์„ ํ†ตํ•ด ์ฐพ์•„ ์ˆ˜์ง‘ํ•˜๊ณ  ์›ํ•˜๋Š” ํ˜•ํƒœ์— ๋งž๊ฒŒ ๊ฐ€๊ณตํ•˜๋Š” ๋ชจ๋“  ๊ณผ์ •.
  • ์‚ฌ์ดํŠธ์˜ ์šด์˜์ž์˜ ์˜์‚ฌ์— ๋ฐ˜ํ•˜์ง€ ์•Š์œผ๋ฉด ํ•ฉ๋ฒ•์ด๊ณ  ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ถˆ๋ฒ•
  • ์‚ฌ์ดํŠธ ๋””๋ ‰ํ† ๋ฆฌ์˜ robots.txtํŒŒ์ผ์„ ๋ณด๋ฉด ํฌ๋กค๋ง์„ ๊ธˆ์ง€ํ•˜๋Š”์ง€ ์•ˆํ•˜๋Š”์ง€ํ‘œ์‹œ๋˜์–ด์žˆ์Œ (Disallow๋ผ๋Š” ํ‘œ์‹œ ์žˆ์œผ๋ฉด ํฌ๋กค๋งํ•˜๋ฉด ์•ˆ ๋จ)
  • ์›นํŽ˜์ด์ง€ ์†Œ์Šค ์ค‘ ์›น ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์š”์†Œ๋Š” ์ €์ž‘๋ฌผ๋กœ ์ธ์ •๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ถˆ๋ฒ• ๋ณต์ œ๋Š” ์ €์ž‘๊ถŒ ์นจํ•ด์— ํ•ด๋‹น.

ํ•„์š”ํŒจํ‚ค์ง€

  • (ํ•„์ˆ˜) pip3 install BeautifulSoup4 or pip3 install bs4
  • (ํ•„์ˆ˜) pip3 install requests
  • (ํ•„์ˆ˜) pip3 install pandas
  • (ํ•„์ˆ˜) pip3 install plotly
  • (์„ ํƒ) pip3 install lxml

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ (๋Œ€๋ถ€๋ถ„ ์„ค์น˜๋˜์–ด์žˆ๋‹ค๋Š” ๊ฐ€์ • ํ•˜)

  • !pip3 install requests
  • !pip3 install beautifulsoup4
!pip3 install requests
!pip3 install beautifulsoup4
# mac , Linux !ls # window !dir
Out[-] C ๋“œ๋ผ์ด๋ธŒ์˜ ๋ณผ๋ฅจ์—๋Š” ์ด๋ฆ„์ด ์—†์Šต๋‹ˆ๋‹ค. ๋ณผ๋ฅจ ์ผ๋ จ ๋ฒˆํ˜ธ: CC5E-6766 C:\Users\leehojun\Google ๋“œ๋ผ์ด๋ธŒ\11_1. ์ฝ˜ํ…์ธ  ๋™์˜์ƒ ๊ฒฐ๊ณผ๋ฌผ\007. ํฌ๋กค๋ง ๊ฐ•์˜ ๋””๋ ‰ํ„ฐ๋ฆฌ 2020-04-03 13:16 <DIR> . 2020-04-03 13:16 <DIR> .. 2020-04-03 13:10 <DIR> .ipynb_checkpoints 2020-04-03 12:18 391,939 001.ipynb 2020-04-03 01:58 <DIR> ์ฐธ๊ณ ์ž๋ฃŒ 2020-04-03 13:16 999 ์ตœ์ข…๊ฐ•์˜์ž๋ฃŒ_ํฌ๋กค๋ง.ipynb 2๊ฐœ ํŒŒ์ผ 392,938 ๋ฐ”์ดํŠธ 4๊ฐœ ๋””๋ ‰ํ„ฐ๋ฆฌ 13,425,782,784 ๋ฐ”์ดํŠธ ๋‚จ์Œ

URL(Uniform Resource Locator)

  • ์ž์›์ด ์–ด๋”” ์žˆ๋Š”์ง€๋ฅผ ์•Œ๋ ค์ฃผ๊ธฐ ์œ„ํ•œ ๊ทœ์•ฝ
  • ํ”ํžˆ ์›น ์‚ฌ์ดํŠธ ์ฃผ์†Œ๋กœ ์•Œ๊ณ  ์žˆ์ง€๋งŒ, URL์€ ์›น ์‚ฌ์ดํŠธ ์ฃผ์†Œ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ปดํ“จํ„ฐ ๋„คํŠธ์›Œํฌ์ƒ์˜ ์ž์›์„ ๋ชจ๋‘ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ
  • ๊ทธ ์ฃผ์†Œ์— ์ ‘์†ํ•˜๋ ค๋ฉด ํ•ด๋‹น URL์— ๋งž๋Š” ํ”„๋กœํ† ์ฝœ์„ ์•Œ์•„์•ผ ํ•˜๊ณ , ๊ทธ์™€ ๋™์ผํ•œ ํ”„๋กœํ† ์ฝœ๋กœ ์ ‘์†(FTP ํ”„๋กœํ† ์ฝœ์ธ ๊ฒฝ์šฐ์—๋Š” FTP ํด๋ผ์ด์–ธํŠธ๋ฅผ ์ด์šฉํ•ด์•ผ ํ•˜๊ณ , HTTP์ธ ๊ฒฝ์šฐ์—๋Š” ์›น ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ด์šฉํ•ด์•ผ ํ•œ๋‹ค. ํ…”๋„ท์˜ ๊ฒฝ์šฐ์—๋Š” ํ…”๋„ท ํ”„๋กœ๊ทธ๋žจ์„ ์ด์šฉํ•ด์„œ ์ ‘์†)
์ถœ์ฒ˜ : Wiki

HTTP(Hypertext Transfer Protocol)

  • HTML, XML, Javascript, ์˜ค๋””์˜ค, ๋น„๋””์˜ค, ์ด๋ฏธ์ง€, PDF, Etc
  • ์š”์ฒญ ๋˜๋Š” ์ƒํƒœ ๋ผ์ธ / ํ•ด๋”(์ƒ๋žต๊ฐ€๋Šฅ) / ๋นˆ์ค„(ํ•ด๋”์˜ ๋) / ๋ฐ”๋””(์ƒ๋žต๊ฐ€๋Šฅ)

HTTP request ex 1

GET /stock.html HTTP/1.1 Host www.paullab.co.kr

HTTP request ex 2

GET /index.html HTTP/1.1 user-agent: MSIE 6.0; Windows NT 5.0 accept: text/html; */* cookie: name = value referer: http://www.naver.com host: www.paullab.co.kr
  1. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐฉ์‹, ๊ธฐ๋ณธ ํŽ˜์ด์ง€, ํ”„๋กœํ† ์ฝœ ๋ฒ„์ „.
  1. User-Agent: ์‚ฌ์šฉ์ž ์›น ๋ธŒ๋ผ์šฐ์ € ์ข…๋ฅ˜ ๋ฐ ๋ฒ„์ „ ์ •๋ณด.
  1. Accept: ์›น ์„œ๋ฒ„๋กœ๋ถ€ํ„ฐ ์ˆ˜์‹ ๋˜๋Š” ๋ฐ์ดํ„ฐ ์ค‘ ์›น ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์˜๋ฏธ.
    1. ์—ฌ๊ธฐ์„œ text/html์€ text, html ํ˜•ํƒœ์˜ ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ณ ,ย  /๋Š” ๋ชจ๋“  ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ. (์ด๋ฅผ MIME ํƒ€์ž…์ด๋ผ ๋ถ€๋ฅด๊ธฐ๋„ ํ•œ๋‹ค.)
  1. Cookie: HTTP ํ”„๋กœํ† ์ฝœ ์ž์ฒด๊ฐ€ ์„ธ์…˜์„ ์œ ์ง€ํ•˜์ง€ ์•Š๋Š” State-less(์ ‘์†์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜์ง€ ์•Š๋Š”) ๋ฐฉ์‹์ด๊ธฐ ๋•Œ๋ฌธ์— ๋กœ๊ทธ์ธ ์ธ์ฆ์„ ์œ„ํ•œ ์‚ฌ์šฉ์ž ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ•˜๋ ค๊ณ  ๋งŒ๋“  ์ธ์œ„์ ์ธ ๊ฐ’. ์ฆ‰ ์‚ฌ์šฉ์ž๊ฐ€ ์ •์ƒ์ ์ธ ๋กœ๊ทธ์ธ ์ธ์ฆ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํŒ๋‹จํ•˜๊ณ ์ž ์‚ฌ์šฉ.
  1. Referer: ํ˜„์žฌ ํŽ˜์ด์ง€ ์ ‘์† ์ „์— ์–ด๋Š ์‚ฌ์ดํŠธ๋ฅผ ๊ฒฝ์œ ํ–ˆ๋Š”์ง€ ์•Œ๋ ค์ฃผ๋Š” ๋„๋ฉ”์ธ ํ˜น์€ URL ์ •๋ณด.
  1. Host: ์‚ฌ์šฉ์ž๊ฐ€ ์š”์ฒญํ•œ ๋„๋ฉ”์ธ์ •๋ณด.

HTTP Response ex 1

HTTP/1.1 200 OK ## ์ƒํƒœ๋ผ์ธ Content-Type: application/xhtml+xml; charset=utf-8 ## ํ•ด๋” ## ๋นˆ์ค„ <html> ## ๋ฐ”๋”” ... </html>

HTTP Response ex 2

HTTP/1.1 OK 200 Server: NCSA/1.4.2 Content-type: text/html Content-length: 107 <html> ... </html>
  1. ์›น ํ”„๋กœํ† ์ฝœ ๋ฒ„์ „ ๋ฐ ์‘๋‹ต ์ฝ”๋“œ ์ •๋ณด๊ฐ€ ํฌํ•จ.
  1. ์›น ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ์ข…๋ฅ˜ ๋ฐ ๋ฒ„์ „ ์ •๋ณด๊ฐ€ ํฌํ•จ.
  1. MIME ํƒ€์ž… ์ •๋ณด๊ฐ€ ํฌํ•จ.
  1. ์ˆ˜์‹  Body ์‚ฌ์ด์ฆˆ ์ •๋ณด๊ฐ€ ํฌํ•จ.
  1. ์‚ฌ์šฉ์ž๊ฐ€ ์š”์ฒญํ•œ ์›น ํŽ˜์ด์ง€ ์ •๋ณด๊ฐ€ ํฌํ•จ.

HTTP ์ฒ˜๋ฆฌ๋ฐฉ์‹

  • GET : ๋ฆฌ์†Œ์Šค ์ทจ๋“ (? ๋’ค์— ์ด์–ด๋ถ™์ด๋Š” ๋ฐฉ์‹ - ์ž‘์€ ๊ฐ’๋“ค)
  • POST : ๋ฆฌ์†Œ์Šค ์ƒ์„ฑ (Body์— ๋ถ™์ด๋Š” ๋ฐฉ์‹ - ์ƒ๋Œ€์ ์œผ๋กœ ํฐ ์šฉ๋Ÿ‰)
  • PUT : ๋ฆฌ์†Œ์Šค ์ˆ˜์ • ์š”์ฒญ
  • DELETE : ๋ฆฌ์†Œ์Šค ์‚ญ์ œ ์š”์ฒญ
  • HEAD : HTTP ํ—ค๋” ์ •๋ณด๋งŒ ์š”์ฒญ, ํ•ด๋‹น ์ž์› ์กด์žฌ ์—ฌ๋ถ€ ํ™•์ธ ๋ชฉ์ 
  • OPTIONS : ์›น์„œ๋ฒ„๊ฐ€ ์ง€์›ํ•˜๋Š” ๋ฉ”์†Œ๋“œ ์ข…๋ฅ˜ ๋ฐ˜ํ™˜ ์š”์ฒญ
  • TRACE : ์š”์ฒญ ๋ฆฌ์†Œ์Šค๊ฐ€ ์ˆ˜์‹ ๋˜๋Š” ๊ฒฝ๋กœ ํ™•์ธ
  • CONNECT : ์š”์ฒญ ๋ฆฌ์†Œ์Šค์— ๋Œ€ํ•ด ์–‘๋ฐฉํ–ฅ ์—ฐ๊ฒฐ ์‹œ์ž‘

์ƒํƒœ ์ฝ”๋“œ

  • 200 : ์„œ๋ฒ„๊ฐ€ ์š”์ฒญ์„ ์ œ๋Œ€๋กœ ์ฒ˜๋ฆฌ.
  • 201 : ์„ฑ๊ณต์ ์œผ๋กœ ์š”์ฒญ๋˜์—ˆ์œผ๋ฉฐ ์„œ๋ฒ„๊ฐ€ ์ƒˆ ๋ฆฌ์†Œ์Šค๋ฅผ ์ž‘์„ฑ.
  • 202 : ์„œ๋ฒ„๊ฐ€ ์š”์ฒญ์„ ์ ‘์ˆ˜ํ–ˆ์ง€๋งŒ ์•„์ง ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š์Œ.
  • 301 : ์š”์ฒญํ•œ ํŽ˜์ด์ง€๋ฅผ ์ƒˆ ์œ„์น˜๋กœ ์˜๊ตฌ์ ์œผ๋กœ ์ด๋™.
  • 403 : ์„œ๋ฒ„๊ฐ€ ์š”์ฒญ์„ ๊ฑฐ๋ถ€.
  • 404 : ์„œ๋ฒ„๊ฐ€ ์š”์ฒญํ•œ ํŽ˜์ด์ง€๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Œ.
  • 500 : ์„œ๋ฒ„์— ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ ์š”์ฒญ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†์Œ.
  • 503 : ์„œ๋ฒ„๊ฐ€ ์˜ค๋ฒ„๋กœ๋“œ๋˜์—ˆ๊ฑฐ๋‚˜ ์œ ์ง€๊ด€๋ฆฌ๋ฅผ ์œ„ํ•ด ๋‹ค์šด๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํ˜„์žฌ์„œ๋ฒ„ ์‚ฌ์šฉ ๋ถˆ๊ฐ€.
์ถœ์ฒ˜ : WIKI
ย 
import requests import bs4
requests.__version__ # requests ๋ฒ„์ „ ํ™•์ธ
Out[-] '2.22.0'
bs4.__version__ # bs4 ๋ฒ„์ „ ํ™•์ธ
Out[-] '4.7.1'
from datetime import datetime datetime.now() # ํ˜„์žฌ ์‹œ๊ฐ„ ์ถœ๋ ฅ
Out[-] datetime.datetime(2020, 4, 3, 13, 36, 59, 789815)

requets

  • HTTP ์š”์ฒญ์„ ๋ณด๋‚ด๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • .text : strํƒ€์ž…์˜ ๋ฐ์ดํ„ฐ๋ฅผ return
  • .headers : header(key/value ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ ์ €์žฅ)์˜ ๋‚ด์šฉ ํ™•์ธ
  • .encoding : ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹ ํ™•์ธ
  • .status_code : HTTP ์š”์ฒญ์— ๋Œ€ํ•ด์„œ ์š”์ฒญ์ด ์„ฑ๊ณตํ–ˆ๋Š”์ง€ ์‹คํŒจํ–ˆ๋Š”์ง€ ํ˜น์€ ์–ด๋–ค ์ƒํƒœ์ธ์ง€ ๋งํ•ด์คŒ
  • .ok : ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธ
import requests html = requests.get('http://www.paullab.co.kr/stock.html') html
Out[-] <Response [200]> # ์„œ๋ฒ„๊ฐ€ ์š”์ฒญ์„ ์ œ๋Œ€๋กœ ์ฒ˜๋ฆฌ
html.text # ํ•œ๊ธ€ ๊นจ์ง€๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒ
Out[-] '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n <meta charset="UTF-8">\n <meta name="viewport" content="width=device-width, initial-scale=1.0">\n <meta http-equiv="X-UA-Compatible" content="ie=edge">\n <title>Document</title>\n <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css">\n <style>\n h1{\n margin: 2rem;\n }\n h1>span{\n font-size: 1rem;\n }\n .main {\n width: 70%;\n margin: 3rem auto auto auto;\n text-align: center\n }\n\n table {\n width: 100%;\n }\n </style>\n</head>\n\n<body>\n <h1>รญ\x81ยฌรซยกยครซยง\x81 รฌ\x97ยฐรฌ\x8aยตรฌ\x9aยฉ รญ\x8e\x98รฌ\x9dยดรฌยง\x80 ... <td class="num"><span>139,085</span></td>\n </tr>\n </tbody>\n </table>\n </div>\n</body>\n\n</html>\n'
html.headers
Out[-] {'Server': 'nginx', 'Date': 'Sat, 04 Apr 2020 11:13:06 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'P3P': "CP='NOI CURa ADMa DEVa TAIa OUR DELa BUS IND PHY ONL UNI COM NAV INT DEM PRE'", 'X-Powered-By': 'PHP/5.5.17p1', 'Content-Encoding': 'gzip'}
dir(html)
Out[-] ['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', ... 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
#ASCII ๊ธฐ๋ฐ˜์˜ ํ™•์žฅ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹ html.encoding
Out[-] 'ISO-8859-1'
html.encoding = 'utf-8' # ํ•œ๊ธ€ ์ถœ๋ ฅ
html.text
Out[-] '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n <meta charset="UTF-8">\n <meta name="viewport" content="width=device-width, initial-scale=1.0">\n <meta http-equiv="X-UA-Compatible" content="ie=edge">\n <title>Document</title>\n <link rel="stylesheet" href= "https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"> \n <style>\n h1{\n margin: 2rem;\n }\n h1>span{\n font-size: 1rem;\n }\n .main {\n width: 70%;\n margin: 3rem auto auto auto;\n text-align: center\n }\n\n table {\n width: 100%;\n }\n </style>\n</head>\n\n<body>\n <h1>ํฌ๋กค๋ง ์—ฐ์Šต์šฉ ํŽ˜์ด์ง€ ... <td class="num"><span>139,085</span></td>\n </tr>\n </tbody>\n </table>\n </div>\n</body>\n\n</html>\n'
html.status_code
Out[-] 200 # 200 : ์„ฑ๊ณตํ–ˆ๋‹ค๋Š” ์˜๋ฏธ
html.ok
Out[-] True
ย 

GET ๋ฐฉ์‹์œผ๋กœ parameter ์ „๋‹ฌํ•˜๋Š” ๋ฐฉ๋ฒ•

<html> <head> </head> <body> <form action="test.html" method="GET"> <input type="text" name="user_id"> <input type="password" name="user_pw"> <input type="submit" name="submit"> </form> </body> </html>
params = {'pa1': 'val1', 'pa2': 'value2'} response = requests.get('http://www.paullab.co.kr', params=params)
response.url
Out[-] 'http://www.paullab.co.kr/?pa1=val1&pa2=value2'
ย 

POST ์š”์ฒญํ•  ๋•Œ data ์ „๋‹ฌ๋ฒ•

import requests, json data = {'pa1': 'val1', 'pa2': 'value2'} response = requests.post('http://www.paullab.co.kr', data=json.dumps(data))

ํ—ค๋” ์ถ”๊ฐ€, ์ฟ ํ‚ค ์ถ”๊ฐ€

headers = {'Content-Type': 'application/json; charset=utf-8'} cookies = {'session_id': 'sorryidontcare'} response = requests.get('http://www.paullab.co.kr', headers=headers, cookies=cookies)

์ธ์ฆ์ถ”๊ฐ€

response = requests.post('http://www.paullab.co.kr', auth=("id","pass"))
ย 

requests๋ฅผ ์ด์šฉํ•œ ํฌ๋กค๋ง

import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser') # ์›ํ•˜๋Š” ๋ฌธ์ž์—ด๋กœ ์ž˜๋ผ์คŒ
print(soup.prettify()) # html ๋ฌธ์„œํ˜•์‹์œผ๋กœ ์ถœ๋ ฅ
Out[-] <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1.0" name="viewport"/> <meta content="ie=edge" http-equiv="X-UA-Compatible"/> <title> Document </title> ... <td class="num"> <span> 139,085 </span> </td> </tr> </tbody> </table> </div> </body> </html>
ย 

ํŠน์ • ํŽ˜์ด์ง€์˜ ์†Œ์Šค์ฝ”๋“œ๋ฅผ ํŒŒ์ผ๋กœ ์ €์žฅ

import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text # url ์ฝ”๋“œ๋ฅผ ํŒŒ์ผ๋กœ ์ €์žฅ f = open('test.html', 'w', encoding='utf-8') f.write(html) f.close()
!dir
Out[-] C ๋“œ๋ผ์ด๋ธŒ์˜ ๋ณผ๋ฅจ์—๋Š” ์ด๋ฆ„์ด ์—†์Šต๋‹ˆ๋‹ค. ๋ณผ๋ฅจ ์ผ๋ จ ๋ฒˆํ˜ธ: CC5E-6766 C:\Users\leehojun\Google ๋“œ๋ผ์ด๋ธŒ\11_1. ์ฝ˜ํ…์ธ  ๋™์˜์ƒ ๊ฒฐ๊ณผ๋ฌผ\007. ํฌ๋กค๋ง ๊ฐ•์˜ ๋””๋ ‰ํ„ฐ๋ฆฌ 2020-04-03 13:55 <DIR> . 2020-04-03 13:55 <DIR> .. 2020-04-03 13:10 <DIR> .ipynb_checkpoints 2020-04-03 13:50 392,338 001.ipynb 2020-04-03 13:55 48,527 test.html # test.html์ด ์ƒ์„ฑ๋˜๋Š”๊ฒƒ์„ ํ™•์ธ 2020-04-03 01:58 <DIR> ์ฐธ๊ณ ์ž๋ฃŒ 2020-04-03 13:54 221,527 ์ตœ์ข…๊ฐ•์˜์ž๋ฃŒ_ํฌ๋กค๋ง.ipynb 3๊ฐœ ํŒŒ์ผ 662,392 ๋ฐ”์ดํŠธ 4๊ฐœ ๋””๋ ‰ํ„ฐ๋ฆฌ 13,301,800,960 ๋ฐ”์ดํŠธ ๋‚จ์Œ
# url ํŒŒ์ผ์—์„œ ํŠน์ •๋‹จ์–ด ์ฐพ๊ธฐ s = html.split(' ') # ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๋ถ„ํ•  # ์•ž๋’ค๋กœ ๋„์–ด์“ฐ๊ธฐ ์•ˆ๋˜์–ด ์žˆ์œผ๋ฉด ๊ฒ€์ƒ‰์ด ์•ˆ๋จ word = input('ํŽ˜์ด์ง€์—์„œ ๊ฒ€์ƒ‰ํ•  ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š” : ') s.count(word)
Out[-] ํŽ˜์ด์ง€์—์„œ ๊ฒ€์ƒ‰ํ•  ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š” : ์ œ์ฃผ 0
ย 

BeautifulSoup

  • strํƒ€์ž…์˜ html ๋ฐ์ดํ„ฐ๋ฅผ html ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋กœ ๊ฐ€๊ณตํ•ด์ฃผ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • BeautifulSoup(markup, "html.parser")
  • BeautifulSoup(markup, "lxml")
  • BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")
  • BeautifulSoup(markup, "html5lib")
import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
soup.title # title ํƒœ๊ทธ ์ถœ๋ ฅ
Out[-] <title>Document</title>
soup.title.string # titleํƒœ๊ทธ์—์„œ ๋ฌธ์ž์—ด๋งŒ ์ถœ๋ ฅ
Out[-] 'Document'
soup.title.text # String ๊ฐ™์€ ๊ธฐ๋Šฅ
Out[-] 'Document'
soup.title.parent.name # ๋ถ€๋ชจ ํƒœ๊ทธ
Out[-] 'head'
soup.tr # table row
Out[-] <tr> <th scope="col">๋‚ ์งœ</th> <th scope="col">์ข…๊ฐ€</th> <th scope="col">์ „์ผ๋น„</th> <th scope="col">์‹œ๊ฐ€</th> <th scope="col">๊ณ ๊ฐ€</th> <th scope="col">์ €๊ฐ€</th> <th scope="col">๊ฑฐ๋ž˜๋Ÿ‰</th> </tr>
soup.td # table data
Out[-] <td align="center "><span class="date">2019.10.23</span></td>
soup.th # table header cell
Out[-] <th scope="col">๋‚ ์งœ</th>
soup.table
Out[-] <table class="table table-hover"> <tbody> <tr> <th scope="col">๋‚ ์งœ</th> <th scope="col">์ข…๊ฐ€</th> <th scope="col">์ „์ผ๋น„</th> <th scope="col">์‹œ๊ฐ€</th> <th scope="col">๊ณ ๊ฐ€</th> <th scope="col">์ €๊ฐ€</th> <th scope="col">๊ฑฐ๋ž˜๋Ÿ‰</th> </tr> <tr> <td align="center "><span class="date">2019.10.23</span></td> <td class="num"><span>6,650</span></td> ... </td> <td class="num"><span>5,300</span></td> <td class="num"><span>5,370</span></td> <td class="num"><span>5,280</span></td> <td class="num"><span>211,019</span></td> </tr> </tbody> </table>
soup.find('title') # find() : ์กฐ๊ฑด์— ๋งž๋Š” ํ•˜๋‚˜์˜ ํƒœ๊ทธ๋ฅผ ์ถœ๋ ฅ
Out[-] <title>Document</title>
soup.find('tr')
Out[-] <tr> <th scope="col">๋‚ ์งœ</th> <th scope="col">์ข…๊ฐ€</th> <th scope="col">์ „์ผ๋น„</th> <th scope="col">์‹œ๊ฐ€</th> <th scope="col">๊ณ ๊ฐ€</th> <th scope="col">์ €๊ฐ€</th> <th scope="col">๊ฑฐ๋ž˜๋Ÿ‰</th> </tr>
soup.find('th')
Out[-] <th scope="col">๋‚ ์งœ</th>
soup.find(id=('update')).text # ํŠน์ • id์˜ text ์ถœ๋ ฅ
Out[-] 'update : 20.12.30'
soup.find('head').find('title') # head ์•ˆ์— title ์ถœ๋ ฅ
Out[-] <title>Document</title>
soup.find('h2', id='์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„์—ฐ๊ตฌ์›') # h2์˜ 'id๊ฐ€ ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„์—ฐ๊ตฌ์›'๋ฅผ ์ถœ๋ ฅ
Out[-] <h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„์—ฐ๊ตฌ์›">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์›</h2>
soup.find_all('h2') # find_all() : ์กฐ๊ฑด์— ๋งž๋Š” ๋ชจ๋“  ํƒœ๊ทธ๋“ค์„ ์ถœ๋ ฅ
Out[-] [<h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„์—ฐ๊ตฌ์›">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์›</h2>, <h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„๊ณต์—…">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ๊ณต์—…</h2>, <h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„์ถœํŒ์‚ฌ">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์ถœํŒ์‚ฌ</h2>, <h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ํ•™์›">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ํ•™์›</h2>]
soup.find_all('h2')[0]
Out[-] <h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„์—ฐ๊ตฌ์›">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์›</h2>
soup.find_all('table', class_='table') # class_ : ์˜ˆ์•ฝ์–ด # ์˜ˆ์•ฝ์–ด : ํŠน์ •ํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ฏธ๋ฆฌ ์˜ˆ์•ฝ๋˜์–ด ์žˆ๋Š”๊ฒƒ
Out[-] [<table class="table table-hover"> <tbody> <tr> <th scope="col">๋‚ ์งœ</th> <th scope="col">์ข…๊ฐ€</th> <th scope="col">์ „์ผ๋น„</th> <th scope="col">์‹œ๊ฐ€</th> <th scope="col">๊ณ ๊ฐ€</th> <th scope="col">์ €๊ฐ€</th> <th scope="col">๊ฑฐ๋ž˜๋Ÿ‰</th> </tr> <tr> <td align="center "><span class="date">2019.10.23</span></td> <td class="num"><span>6,650</span></td> ... </td> <td class="num"><span>2,020</span></td> <td class="num"><span>2,090</span></td> <td class="num"><span>2,020</span></td> <td class="num"><span>139,085</span></td> </tr> </tbody> </table>]
soup = BeautifulSoup(''' <hojun id='jeju' class='codingBaseCamp codingLevelUp'> hello world </hojun> ''') # tag = hojun , id = 'jeju' , class = 'codingBaseCamp codingLevelUp' tag = soup.hojun tag
Out[-] <hojun class="codingBaseCamp codingLevelUp" id="jeju"> hello world </hojun>
type(tag)
Out[-] bs4.element.Tag
dir(tag) # tag์˜ method
Out[-] ['HTML_FORMATTERS', 'XML_FORMATTERS', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', ... 'setup', 'string', 'strings', 'stripped_strings', 'text', 'unwrap', 'wrap']
tag.name
Out[-] 'hojun'
tag['class']
Out[-] ['codingBaseCamp', 'codingLevelUp']
tag['id']
Out[-] 'jeju'
tag.attrs # ์ •๋ณด๋ฅผ ํ•œ๋ฒˆ์— ๋ณด๊ณ ์‹ถ์„๋•Œ ์‚ฌ์šฉ
Out[-] {'id': 'jeju', 'class': ['codingBaseCamp', 'codingLevelUp']}
tag.string # ๋ฌธ์ž์—ด ์ถœ๋ ฅ
Out[-] '\n hello world\n'
tag.text # ๋ฌธ์ž์—ด ์ถœ๋ ฅ
Out[-] '\n hello world\n'
tag.contents # list๋กœ ์ถœ๋ ฅ
Out[-] ['\n hello world\n']
for i in tag.children: # children : ์ข…์† ํƒœ๊ทธ print(i)
Out[-] hello world
tag.children
Out[-] <list_iterator at 0x1cd52f37550>
soup = BeautifulSoup(''' <ul> <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li> <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li> <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li> </ul> ''') tag = soup.ul tag
Out[-] <ul> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> </ul>
tag.contents # list
Out[-] ['\n', <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>, '\n', <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>, '\n', <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>, '\n']
tag.li # li ํƒœ๊ทธ ์ถœ๋ ฅ
Out[-] <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
tag.li.parent # li ํƒœ๊ทธ์˜ ๋ถ€๋ชจ ํƒœ๊ทธ ์ถœ๋ ฅ
Out[-] <ul> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li> </ul>
ย 

Selector

  • ํƒœ๊ทธ์— ์ข€ ๋” ์„ธ๋ฐ€ํ•œ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅ
  • class๋ฅผ ์ง€์นญํ•  ๋•Œ๋Š” '.'์„ ์‚ฌ์šฉํ•˜๊ณ , id๋ฅผ ์ง€์นญํ•  ๋•Œ๋Š” '#'๋ฅผ ์‚ฌ์šฉ
  • ํƒ์ƒ‰ํ•˜๊ณ ์ž ํ•˜๋Š” ํƒœ๊ทธ๊ฐ€ ํŠน์ •ํƒœ๊ทธ ํ•˜์œ„์— ์žˆ์„ ๋•Œ '>'๋ฅผ ์‚ฌ์šฉ
import requests from bs4 import BeautifulSoup response = requests.get('http://www.paullab.co.kr/stock.html') response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
soup.select('#update')
Out[-] [<span id="update">update : 20.12.30</span>]
soup.select('.table > tr') # 'table' class ์•ˆ์— ๋ชจ๋“  tr ํƒœ๊ทธ ์ถœ๋ ฅ # ์ˆœ์„œ : table > tbody > tr (๋ฐ”๋กœ ์•„๋ž˜ ์•„๋‹ˆ๋ฉด ์‹คํ–‰์•ˆ๋จ)
Out[-] []
soup.select('.table > tbody > tr')[2] # 'table' class ์•ˆ์— tbody ์•ˆ์— ๋ชจ๋“  tr ํƒœ๊ทธ ์ถœ๋ ฅ
Out[-] <tr> <td align="center"><span class="date">2019.10.22</span></td> <td class="num"><span>6,630</span></td> <td class="num"> <img alt="ํ•˜๋ฝ" height="6" src="ico_down.gif" style="margin-right:4px;" width="7"/> <span class="tah p11 nv01"> 190 </span> </td> <td class="num"><span>6,830</span></td> <td class="num"><span>6,930</span></td> <td class="num"><span>6,530</span></td> <td class="num"><span>919,571</span></td> </tr>
# ์š”์†Œ ์„ ํƒ ๋ฐฉ๋ฒ• soup.select("p > a:nth-of-type(2)") # p > a tag ์ธ๋ฐ 2๋ฒˆ์งธ ์š”์†Œ soup.select("p > a:nth-child(even)") # p > a tag ์ธ๋ฐ ์ง์ˆ˜๋‚˜ ํ™€์ˆ˜๋ฒˆ์งธ ์š”์†Œ soup.select('a[href]') # ํŠน์ • attribute ์š”์†Œ soup.select("#link1 + .sister") # id์™€ class๋ฅผ ๋™์‹œ์— ๊ฐ€์ง„ ์š”์†Œ
oneStep = soup.select('.main')[0] # ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์› oneStep
Out[-] <div class="main"> <h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„์—ฐ๊ตฌ์›">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์›</h2> <h3><span style="color: salmon">์ผ๋ณ„</span> ์‹œ์„ธ</h3> <table class="table table-hover"> <tbody> <tr> <th scope="col">๋‚ ์งœ</th> <th scope="col">์ข…๊ฐ€</th> <th scope="col">์ „์ผ๋น„</th> <th scope="col">์‹œ๊ฐ€</th> <th scope="col">๊ณ ๊ฐ€</th> <th scope="col">์ €๊ฐ€</th> <th scope="col">๊ฑฐ๋ž˜๋Ÿ‰</th> </tr> ... <td class="num"><span>5,300</span></td> <td class="num"><span>5,370</span></td> <td class="num"><span>5,280</span></td> <td class="num"><span>211,019</span></td> </tr> </tbody> </table> </div>
twoStep = oneStep.select('tbody > tr')[1:] twoStep
Out[-] <tr> <td align="center "><span class="date">2019.10.23</span></td> <td class="num"><span>6,650</span></td> <td class="num"> <img alt="์ƒ์Šน " height="6 " src="ico_up.gif " style="margin-right:4px; " width="7 "/> <span> 20 </span> ... 10 </span> </td> <td class="num"><span>5,300</span></td> <td class="num"><span>5,370</span></td> <td class="num"><span>5,280</span></td> <td class="num"><span>211,019</span></td> </tr>]
twoStep[0].select('td')[0].text # ๋‚ ์งœ
Out[-] '2019.10.23'
twoStep[0].select('td')[1].text # ์ข…๊ฐ€
Out[-] '6650' # ๋ฌธ์žํ˜•์ด๊ธฐ๋•Œ๋ฌธ์— ๋งŒ์•ฝ ๊ณ„์‚ฐ์— ์ด์šฉํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด ์ˆซ์žํ˜•์œผ๋กœ ๋ฐ”๊ฟ”์ค˜์•ผํ•จ. # twoStep[0].select('td')[1].text.replace(',', '')
๋‚ ์งœ = [] ์ข…๊ฐ€ = [] for i in twoStep: ๋‚ ์งœ.append(i.select('td')[0].text) ์ข…๊ฐ€.append(int(i.select('td')[1].text.replace(',', '')))
๋‚ ์งœ
Out[-] ['2019.10.23', '2019.10.22', '2019.10.21', '2019.10.18', '2019.10.17', '2019.10.16', '2019.10.15', '2019.10.14', '2019.10.11', '2019.10.10', '2019.10.08', '2019.10.07', '2019.10.04', '2019.10.02', '2019.10.01', '2019.09.30', '2019.09.27', '2019.09.26', '2019.09.25', '2019.09.24']
์ข…๊ฐ€
Out[-] [6650, 6630, 6820, 6430, 5950, 5930, 5640, 5380, 5040, 5100, 5050, 4940, 5010, 4920, 5010, 5000, 5010, 5060, 5060, 5330]
# ์‹œ๊ฐํ™” # ๋‚ ์งœ๋ณ„๋กœ ๊ฐ€๊ฒฉ ๋ณ€๋™ ์ถ”์ด import plotly.express as px fig = px.line(x=๋‚ ์งœ, y=์ข…๊ฐ€, title='jejucodingcamp') fig.show()
Out[-]
notion imagenotion image
ย 

์—ฐ์Šต๋ฌธ์ œ

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ (๋Œ€๋ถ€๋ถ„ ์„ค์น˜๋˜์–ด์žˆ๋‹ค๋Š” ๊ฐ€์ • ํ•˜)

  • !pip3 install requests
  • !pip3 install beautifulsoup4

๋ฌธ์ œ 1๋ฒˆ

๊ฐ ํšŒ์‚ฌ๋ณ„ 1๋งŒ์ฃผ์”ฉ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ, ์ „๊ทธ๋ฃน์‚ฌ ์‹œ๊ฐ€์ด์•ก์„ ๊ตฌํ•ด์ฃผ์„ธ์š”.
  • ๊ทธ๋ฃน์‚ฌ : [ ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์›, ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ๊ณต์—…, ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์ถœํŒ์‚ฌ, ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ํ•™์›]
import requests from bs4 import BeautifulSoup response = requests.get("http://www.paullab.co.kr/stock.html") response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
soup.select('.main')[0] # ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์› soup.select('.main')[1] # ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ๊ณต์—… soup.select('.main')[2] # ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์ถœํŒ์‚ฌ soup.select('.main')[3] # ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ํ•™์›
Out[-] <div class="main"> <h2 id="์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ํ•™์›">์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ํ•™์›</h2> <h3><span style="color: salmon">์ผ๋ณ„</span> ์‹œ์„ธ</h3> <table class="table table-hover"> <tbody> <tr> <th scope="col">๋‚ ์งœ</th> <th scope="col">์ข…๊ฐ€</th> <th scope="col">์ „์ผ๋น„</th> <th scope="col">์‹œ๊ฐ€</th> <th scope="col">๊ณ ๊ฐ€</th> <th scope="col">์ €๊ฐ€</th> <th scope="col">๊ฑฐ๋ž˜๋Ÿ‰</th> </tr> ... <td class="num"><span>2,020</span></td> <td class="num"><span>2,090</span></td> <td class="num"><span>2,020</span></td> <td class="num"><span>139,085</span></td> </tr> </tbody> </table> </div>
๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€ = soup.select('.main') ์˜ค๋Š˜์ข…๊ฐ€ = [] ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก = [] for i in ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€: print(i.select('.table > tbody > tr')[1].select('td')[1]) print(i.select('.table > tbody > tr')[1].select('td')[1].text) print(i.select('.table > tbody > tr')[1].select('td')[1].text.replace(',', ''))
Out[-] <td class="num"><span>6,650</span></td> # 10.23์ผ, ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์—ฐ๊ตฌ์›, ์ข…๊ฐ€ 6,650 6650 <td class="num"><span>31,300</span></td> # 10.23์ผ, ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ๊ณต์—…, ์ข…๊ฐ€ 31,300 31300 <td class="num"><span>13,250</span></td> # 10.23์ผ, ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ์ถœํŒ์‚ฌ, ์ข…๊ฐ€ 13,250 13250 <td class="num"><span>2,600</span></td> # 10.23์ผ, ์ œ์ฃผ์ฝ”๋”ฉ๋ฒ ์ด์Šค์บ ํ”„ ํ•™์›, ์ข…๊ฐ€ 2,600 2600
๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€ = soup.select('.main') ์˜ค๋Š˜์ข…๊ฐ€ = [] ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก = [] for i in ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€: ์˜ค๋Š˜์ข…๊ฐ€.append(int(i.select('.table > tbody > tr')[1].select('td')[1]. select('td > span')[0].text.replace(',', ''))) print(์˜ค๋Š˜์ข…๊ฐ€)
Out[-] [6650, 31300, 13250, 2600]
์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก = [i*10000 for i in ์˜ค๋Š˜์ข…๊ฐ€] ์ „๊ทธ๋ฃน์‚ฌ์‹œ๊ฐ€์ด์•ก = format(sum(์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก), ',') ์ „๊ทธ๋ฃน์‚ฌ์‹œ๊ฐ€์ด์•ก
Out[-] '538,000,000'
ย 

๋ฌธ์ œ 2๋ฒˆ

์ „๊ทธ๋ฃน์‚ฌ ์‹œ๊ฐ€์ด์•ก ์ถ”์ด๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค์ฃผ์„ธ์š”. x์ถ•์€ ๋‚ ์งœ, y์ถ•์€ ๊ฐ€๊ฒฉ์ž…๋‹ˆ๋‹ค.
# ๊ฐ๊ทธ๋ฃน์‚ฌ์˜ ์ผ์ผ ์‹œ๊ฐ€์ด์•ก์„ ๊ตฌํ•จ ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€ = soup.select('.main') ์˜ค๋Š˜์ข…๊ฐ€ = [] ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก = [] for i in ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€: ์˜ค๋Š˜์ข…๊ฐ€.append(int(i.select('.table > tbody > tr')[1].select('td')[1]. select('td > span')[0].text.replace(',', '')))
๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€ = soup.select('.main') ์˜ค๋Š˜์ข…๊ฐ€ = [] ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก = [] for j in range(1, len(soup.select('.main')[0].select('table > tbody > tr'))): ์˜ค๋Š˜์ข…๊ฐ€ = [] for i in ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ์‹œ๊ฐ€: ์˜ค๋Š˜์ข…๊ฐ€.append(int(i.select('.table > tbody > tr')[j].select('td')[1]. select('td > span')[0].text.replace(',', ''))) ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก.append(sum(์˜ค๋Š˜์ข…๊ฐ€))
์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก
Out[-] [53800, 53180, 53615, 52305, 49035, 48755, 46970, 46140, 45900, 45765, 44000, 43210, 43830, 44310, 44850, 44370, 43935, 44180, 44410, 46245]
# ๋‚ ์งœ table ํฌ๋กค๋ง ๋‚ ์งœ์ „์ฒด = soup.select('.main')[0].select('.table > tbody > tr > td > .date') date = [] for i in ๋‚ ์งœ์ „์ฒด: date.append(i.text) date
Out[-] ['2019.10.23', '2019.10.22', '2019.10.21', '2019.10.18', '2019.10.17', '2019.10.16', '2019.10.15', '2019.10.14', '2019.10.11', '2019.10.10', '2019.10.08', '2019.10.07', '2019.10.04', '2019.10.02', '2019.10.01', '2019.09.30', '2019.09.27', '2019.09.26', '2019.09.25', '2019.09.24']
%matplotlib inline import matplotlib.pyplot as plt plt.plot(date, ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก) plt.xticks(rotation = -45 ) # y ์ถ• ๋ณ€์ˆ˜ ๊ธฐ์šธ๊ธฐ ์„ค์ • plt.show()
Out[-]
notion imagenotion image
%matplotlib inline # ๋‚ ์งœ์ˆœ์„ ์ •๋ ฌํ•˜์—ฌ ์žฌ์ถœ๋ ฅ import matplotlib.pyplot as plt plt.plot(date[::-1], ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก[::-1]) plt.xticks(rotation = -45 ) plt.show()
Out[-]
notion imagenotion image
import requests from bs4 import BeautifulSoup response = requests.get("http://www.paullab.co.kr/stock.html") response.encoding = 'utf-8' html = response.text soup = BeautifulSoup(html, 'html.parser')
Out[-] [538000000, 531800000, 536150000, 523050000, 490350000, 487550000, 469700000, 461400000, 459000000, 457650000, 440000000, 432100000, 438300000, 443100000, 448500000, 443700000, 439350000, 441800000, 444100000, 462450000]
# ๋‚ ์งœ table ํฌ๋กค๋ง ๋‚ ์งœ = soup.select('.main')[0].select('.table > tbody > tr > td > .date') date = [] for i in ๋‚ ์งœ: date.append(i.text) date
Out[-] ['2019.10.23', '2019.10.22', '2019.10.21', '2019.10.18', '2019.10.17', '2019.10.16', '2019.10.15', '2019.10.14', '2019.10.11', '2019.10.10', '2019.10.08', '2019.10.07', '2019.10.04', '2019.10.02', '2019.10.01', '2019.09.30', '2019.09.27', '2019.09.26', '2019.09.25', '2019.09.24']
# ์‹œ๊ฐํ™” %matplotlib inline import matplotlib.pyplot as plt plt.plot(date[::-1], ์˜ค๋Š˜์‹œ๊ฐ€์ด์•ก[::-1]) plt.xticks(rotation = -45) plt.show()
Out[-]
notion imagenotion image

๋ฌธ์ œ 3๋ฒˆ

๊ฐ ํšŒ์‚ฌ๋ณ„ ๊ฑฐ๋ž˜ ์ด๋Ÿ‰๊ณผ ์ „๊ทธ๋ฃน์‚ฌ ๊ฑฐ๋ž˜ ์ด๋Ÿ‰์„ subplot์œผ๋กœ ๊ทธ๋ ค์ฃผ์„ธ์š”.
๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๋ฐ์ดํ„ฐ = soup.select('.main') ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰ = [[],[],[],[]] ๊ทธ๋ฃน์‚ฌ์ „์ฒด์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰ = [] # ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ : # ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰ = [[์ถœํŒ์‚ฌ], [์—ฐ๊ตฌ์›], [๊ณต์—…์‚ฌ], [ํ•™์›]]
๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๋ฐ์ดํ„ฐ[0].select('.table > tbody > tr')[0] ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๋ฐ์ดํ„ฐ[0].select('.table > tbody > tr')[1].select('td')[-1].text.replace(',','')
Out[-] '398421'
for j in range(1, len(soup.select('.main')[0].select('table > tbody > tr'))): ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[0].append(int(๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๋ฐ์ดํ„ฐ[0].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', ''))) ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[1].append(int(๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๋ฐ์ดํ„ฐ[1].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', ''))) ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[2].append(int(๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๋ฐ์ดํ„ฐ[2].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', ''))) ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[3].append(int(๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๋ฐ์ดํ„ฐ[3].select('.table > tbody > tr')[j]. select('td')[-1].text.replace(',', '')))
๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[0] # ์ถœํŒ์‚ฌ ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[1] # ์—ฐ๊ตฌ์› ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[2] # ๊ณต์—…์‚ฌ ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[3] # ํ•™์› len(๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[0]) ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[0]
Out[-] [398421, 919571, 1678055, 2168857, 1982922, 839434, 702104, 764800, 134558, 288563, 223839, 199580, 188467, 160510, 246145, 705046, 408859, 404633, 441923, 211019]
%matplotlib inline import matplotlib.pyplot as plt plt.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[0][::-1], label='A') plt.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[1][::-1], label='B') plt.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[2][::-1], label='C') plt.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[3][::-1], label='D') plt.xticks(rotation = -45 ) plt.legend(loc=2) plt.show()
Out[-]
notion imagenotion image
for i in range(len(๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[0])): s = 0 for j in range(4): s += ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[j][i] ๊ทธ๋ฃน์‚ฌ์ „์ฒด์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰.append(s) ๊ทธ๋ฃน์‚ฌ์ „์ฒด์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰
Out[-] [3198301, 2051067, 3724291, 4286651, 3167249, 2477184, 1456343, 1174487, 771938, 1463947, 698527, 673095, 562816, 650582, 784490, 1239662, 872050, 868624, 1115164, 803201]
%matplotlib inline import matplotlib.pyplot as plt plt.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ์ „์ฒด์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[::-1], label='ALL') plt.xticks(rotation = -45 ) plt.legend(loc=2) plt.show()
Out[-]
notion imagenotion image
f = plt.figure(figsize=(10,3)) # 1๋ฒˆ ๊ทธ๋ฆผ (๊ทธ๋ฃน์‚ฌ๋ณ„) ax = f.add_subplot(121) ax.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[0][::-1], label='A') ax.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[1][::-1], label='B') ax.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[2][::-1], label='C') ax.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ๋ณ„์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[3][::-1], label='D') plt.xticks(rotation = -45) ax.legend(loc=2) # 2๋ฒˆ ๊ทธ๋ฆผ (์ „์ฒด) ax2 = f.add_subplot(122) ax2.figsize=(15,15) ax2.plot(date[::-1], ๊ทธ๋ฃน์‚ฌ์ „์ฒด์ผ์ผ๊ฑฐ๋ž˜๋Ÿ‰[::-1], label='ALL') plt.xticks(rotation = -45) ax2.legend(loc=2)
Out[-]
notion imagenotion image
ย