Nokogiriを使ったRubyでのスクレイピング　〜初心者編〜

今回は、ruby, pythonを利用した情報収集について、書いていきます。(初めての技術的な解説なのでドキドキ、ワクワク)

背景

例えば、今在籍している500では、「明日までにアメリカ全国の大学の情報２万件集めといて！」(number_of_students, student_to_faculty_ratio, public_or_private, location, etc..)みたいなお題が、優雅にオフィスの隅っこでご飯をつついている時に、突然降ってきます。逐一サイト見つけてコピペしてたらもちろん終わりません。そもそも、ヘタレにはそんな根性はありません。そんな時、僕の心強い味方がスクレイピングです。

スクレイピングとは何か

スクレイピングとは、WebサイトからWebページのHTMLデータを収集して、特定のデータを抽出、整形し直すことです。Webスクレイピングを行うことで、Webページを対象として、あたかもWeb APIを利用しているかのようにデータを効率的に取得・収集することが可能となるのです！！(感嘆)

Webスクレイピングとは (Web scraping) ウェブスクレイピング： - IT用語辞典バイナリ

Nokogiriとは何か

Nokogiriとは、HTMLやXMLの構造を解析して、特定の要素を指定しやすい形に加工できるとてもステキなライブラリです。XpathやCSS セレクタを使った要素の抽出を行うことができます。

Nokogiri の基本(翻訳版) - Engine Yard Blog

インストール

# Nokogiriのインストール。

gem install nokogiri 

# Railsのアプリ内で利用する場合は、Gemfileにコードを追加。

gem 'nokogiri'

# bundle install

bundle install

使い方

それでは、実際にBestVenture100から100社のベンチャー企業の企業名、ロゴ、URLを取得してくるRubyのスクリプトを作ります。
まず、ファイルbest_venture_100_list.rbを作成します。作成したファイルをエディタで開いて次のようなソースを書きます。

# URLにアクセスするためのライブラリの読み込み
require 'open-uri'
# Nokogiriライブラリの読み込み
require 'nokogiri'
# 文字をShift_JISに変更するためのライブラリの読み込み  
require 'kconv'
# csvに書き出すためのライブラリの読み込み
require 'csv'

count = 0      

best_venture_100_list = []

  # スクレイピング先のURL
  url = "http://best100.v-tsushin.jp/"

  charset = nil
  html = open(url) do |f|
    charset = f.charset # 文字種別を取得
    f.read # htmlを読み込んで変数htmlに渡す
  end

  # htmlをパース(解析)してオブジェクトを生成
  doc = Nokogiri::HTML.parse(html.toutf8, nil, 'utf-8')
  # id=companyListのulを取得
  companies = doc.css("ul#companyList")
  # id=companyListのul以下のliを配列で取得
  company_list = companies.css("li")
  # 配列内のliを一つ一つeachでfetchする
  company_list.each do |company|
    # dataを保存するArrayを作成
    data = []
    count += 1
    # liの子要素であるimgのtitle情報を取得
    name = company.css("img")[0][:title]
    # liの子要素であるaのhrehパラメータを取得
    url = company.css("a")[0][:href]
    # liの子要素であるimgのsrcパラメータを取得
    logo = company.css("img")[0][:src]

    # 各要素を確認
    p count, name, url, logo
    
    # data配列に取得した情報を格納
    data.push(count)
    data.push(name.tosjis)
    data.push(url.tosjis)
    data.push(logo.tosjis)

    # best_venture_100_list配列に、data配列を格納し、二重配列を作成(csvfileへの変換のため)
    best_venture_100_list.push(data)        
  end
  # 二重配列となったbest_venture_100_listをターミナルで表示

raise
# CSVにエクスポート
CSV.open("best_venture_100_list.csv", "wb") do |csv|
  best_venture_100_list.each do |r|
    csv << r
  end
end

表示結果が以下のように慣れば正解です。また、同時にCSVファイルもソースコードと同じディレクトリに自動生成されます。

1
"ヴォラーレ株式会社"
"http://best100.v-tsushin.jp/2010/05/volare.php"
"img/logo/volare_logo.jpg" 
...
(省略)
...
100
"株式会社ハロネット"
"http://best100.v-tsushin.jp/2013/12/post_138.php"
"img/logo/hallonet-logo-top.jpg"

詳細解説

そして、ここからが本題。scrapingの情報を探しても、中々良質の記事なかったので、具体的な利用方法を記載してゆきます。それでは、以下の様な簡単なサイトがあったとしましょう。

<html>
   <head><title>My webpage</title></head>
   <body>
   <h1>Hello Webpage!</h1>
   <div id="references">
      <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
      <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
      <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
   </div>

   <div id="funstuff">
      <p>Here are some entertaining links:</p>
      <ul>
         <li><a href="http://youtube.com">YouTube</a></li>
         <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
         <li><a href="http://kathack.com/">Kathack</a></li>
         <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
      </ul>
   </div>

   <p>Thank you for reading my webpage!</p>

   </body>
</html>

以上のサイトをこねこねスクレイピングしていきます。以下、各要素の取得方法を羅列していきます。(はてなテーブルの中でマークアップできなかった。読みにくくてすみません。)

title取得

#The <title> element
page.css('title')

<html>
<head><title>My webpage</title></head>
<body>
<h1>Hello Webpage!</h1>
<div id="references">

↑↑↑title要素が取得される

li取得

# All <li> elements
page.css('li')

<div id="funstuff">
   <p>Here are some entertaining links:</p>
   <ul>
      <li><a href="http://youtube.com">YouTube</a></li>
      <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
      <li><a href="http://kathack.com/">Kathack</a></li>
      <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
   </ul>
</div>

↑↑↑各li要素が配列で取得される

liのstringを表示

# The text of the first <li> element
page.css('li')[0].text

<div id="funstuff">
   <p>Here are some entertaining links:</p>
   <ul>
      <li><a href="http://youtube.com">YouTube</a></li>
      <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
      <li><a href="http://kathack.com/">Kathack</a></li>
      <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
   </ul>
   </div>

↑↑↑"YouTube"が取得せる

１つ目のliのパラメーター取得

# The url of the second <li> element
page.css('li')[1]['href']

<div id="funstuff">
<p>Here are some entertaining links:</p>
<ul>
   <li><a href="http://youtube.com">YouTube</a></li>
   <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
   <li><a href="http://kathack.com/">Kathack</a></li>
   <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
</ul>
</div>

↑↑↑"http://youtube.com"が取得される

data-category=newのli要素の取得

# The <li> elements with a data-category of news
page.css("li[data-category='news']")

<div id="funstuff">
   <p>Here are some entertaining links:</p>
   <ul>
      <li><a href="http://youtube.com">YouTube</a></li>
      <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
      <li><a href="http://kathack.com/">Kathack</a></li>
      <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
   </ul>
   </div>

↑↑↑ 2番目のli要素が取得される

idがfunstuff のdiv要素の取得

# The <div> element with an id of "funstuff"
page.css('div#funstuff')[0]

<h1>Hello Webpage!</h1>
<div id="references">
   <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
   <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
   <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
</div>

<div id="funstuff">
   <p>Here are some entertaining links:</p>
   <ul>
      <li><a href="http://youtube.com">YouTube</a></li>
      <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
      <li><a href="http://kathack.com/">Kathack</a></li>
      <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
   </ul>
</div>

↑↑↑ idがfunstuff のdiv要素が取得される

idがfunstuff のdiv要素の子要素であるa要素を取得

# The <a> elements nested inside the <div> element that has an id of "reference"
page.css('div#reference a')

<h1>Hello Webpage!</h1>
<div id="references">
   <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
   <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
   <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
</div>
<div id="funstuff">
   <p>Here are some entertaining links:</p>
   <ul>
      <li><a href="http://youtube.com">YouTube</a></li>
      <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
      <li><a href="http://kathack.com/">Kathack</a></li>
      <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
   </ul>
</div>

↑↑↑ idがfunstuff のdiv要素の子要素であるa要素が取得される

上記に書いたように、cssのリファレンスを理解すると、綺麗に情報取得ができるようになる。stringで表示したいときは、

page.css('li')[1]['href'].text

と付け足せば表示されます。

以上、初のプログラミング投稿でした。何か分からない点、詳しく知りたい点がありましたら、@hetare_dreamまでリプください。

アデュー
＠サンフランシスコ、シリコンバレー

参照サイト：
Rubyで始めるWebスクレイピング
 Nokogiri を使った Rubyスクレイピング [初心者向けチュートリアル] - 酒と泪とRubyとRailsと

HETAREDREAM

夢のあるヘタレが世界で活躍する起業家になるまでの物語