Class: WaybackArchiver::URLCollector

Inherits:
Object
  • Object
show all
Defined in:
lib/wayback_archiver/url_collector.rb

Overview

Retrive URLs from different sources

Class Method Summary collapse

Class Method Details

.crawl(url, hosts: [], limit: WaybackArchiver.max_limit) ⇒ Array<String>

Retrieve URLs by crawling.

Examples:

Crawl URLs defined on example.com

URLCollector.crawl('http://example.com')

Crawl URLs defined on example.com and limit the number of visited pages to 100

URLCollector.crawl('http://example.com', limit: 100)

Crawl URLs defined on example.com and explicitly set no upper limit on the number of visited pages to 100

URLCollector.crawl('http://example.com', limit: -1)

Crawl multiple hosts

URLCollector.crawl(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)

Parameters:

  • url (String)

    domain to crawl URLs from.

  • hosts (Array<String, Regexp>) (defaults to: [])

    to crawl.

Returns:

  • (Array<String>)

    of URLs defined found during crawl.



37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/wayback_archiver/url_collector.rb', line 37

def self.crawl(url, hosts: [], limit: WaybackArchiver.max_limit)
  urls = []
  start_at_url = Request.build_uri(url).to_s
  options = {
    robots: WaybackArchiver.respect_robots_txt,
    hosts: hosts,
    user_agent: WaybackArchiver.user_agent
  }
  options[:limit] = limit unless limit == -1

  Spidr.site(start_at_url, options) do |spider|
    spider.every_page do |page|
      page_url = page.url.to_s
      urls << page_url
      WaybackArchiver.logger.debug "Found: #{page_url}"
      yield(page_url) if block_given?
    end
  end
  urls
end

.sitemap(url) ⇒ Array<String>

Retrieve URLs from Sitemap.

Examples:

Get URLs defined in Sitemap for google.com

URLCollector.sitemap('https://google.com/sitemap.xml')

Parameters:

  • url (String)

    domain to retrieve Sitemap from.

Returns:

  • (Array<String>)

    of URLs defined in Sitemap.



15
16
17
# File 'lib/wayback_archiver/url_collector.rb', line 15

def self.sitemap(url)
  Sitemapper.urls(url: Request.build_uri(url))
end