Class: WaybackArchiver::Archive

Inherits:
Object
  • Object
show all
Defined in:
lib/wayback_archiver/archive.rb

Overview

Post URL(s) to Wayback Machine

Class Method Summary collapse

Class Method Details

.crawl(source, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit) {|archive_result| ... } ⇒ Array<ArchiveResult>

Send URLs to Wayback Machine by crawling the site.

Examples:

Crawl example.com and send all URLs of the same domain

Archiver.crawl('example.com')
Archiver.crawl('example.com') do |result|
  puts [result.code || 'error', result.url] # print response status and URL
end

Crawl example.com and send all URLs of the same domain with low concurrency

Archiver.crawl('example.com', concurrency: 1)

Stop after archiving 100 links

Archiver.crawl('example.com', limit: 100)

Crawl multiple hosts

URLCollector.crawl(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)

Parameters:

  • source (String)

    for URL to crawl.

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

    the default is 1

  • hosts (Array<String, Regexp>) (defaults to: [])

    to crawl

Yields:

  • (archive_result)

    If a block is given, each result will be yielded

Yield Parameters:

Returns:

  • (Array<ArchiveResult>)

    with URLs sent to the Wayback Machine.



78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# File 'lib/wayback_archiver/archive.rb', line 78

def self.crawl(source, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit)
  WaybackArchiver.logger.info "Request are sent with up to #{concurrency} parallel threads"

  posted_urls = Concurrent::Array.new
  pool = ThreadPool.build(concurrency)

  found_urls = URLCollector.crawl(source, hosts: hosts, limit: limit) do |url|
    pool.post do
      result = post_url(url)
      yield(result) if block_given?
      posted_urls << result unless result.errored?
    end
  end
  WaybackArchiver.logger.info "Crawling of #{source} finished, found #{found_urls.length} URL(s)"
  pool.shutdown
  pool.wait_for_termination

  WaybackArchiver.logger.info "#{posted_urls.length} URL(s) posted to Wayback Machine"
  posted_urls
end

.post(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit) {|archive_result| ... } ⇒ Array<ArchiveResult>

Send URLs to Wayback Machine.

Examples:

Archive urls, asynchronously

Archive.post(['http://example.com'])
Archiver.post(['http://example.com']) do |result|
  puts [result.code || 'error', result.url] # print response status and URL
end

Archive urls, using only 1 thread

Archive.post(['http://example.com'], concurrency: 1)

Stop after archiving 100 links

Archive.post(['http://example.com'], limit: 100)

Explicitly set no limit on how many links are posted

Archive.post(['http://example.com'], limit: -1)

Parameters:

  • urls (Array<String>)

    to send to the Wayback Machine.

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

    the default is 1

Yields:

  • (archive_result)

    If a block is given, each result will be yielded

Yield Parameters:

Returns:



26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/wayback_archiver/archive.rb', line 26

def self.post(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit)
  WaybackArchiver.logger.info "Total URLs to be sent: #{urls.length}"
  WaybackArchiver.logger.info "Request are sent with up to #{concurrency} parallel threads"

  urls_queue = if limit == -1
                 urls
               else
                 urls[0...limit]
               end

  posted_urls = Concurrent::Array.new
  pool = ThreadPool.build(concurrency)

  urls_queue.each do |url|
    pool.post do
      result = post_url(url)
      yield(result) if block_given?
      posted_urls << result unless result.errored?
    end
  end

  pool.shutdown
  pool.wait_for_termination

  WaybackArchiver.logger.info "#{posted_urls.length} URL(s) posted to Wayback Machine"
  posted_urls
end

.post_url(url) ⇒ ArchiveResult

Send URL to Wayback Machine.

Examples:

Archive example.com, with default options

Archive.post_url('http://example.com')

Parameters:

  • url (String)

    to send.

Returns:



104
105
106
# File 'lib/wayback_archiver/archive.rb', line 104

def self.post_url(url)
  WaybackArchiver.adapter.call(url)
end