aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: 54d636d46cff19bebe5e2032b3a2af16a26f3848 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Google Group Scrapper

![Build Status](https://secure.travis-ci.org/mcepl/gg_scraper.png)

A small script as a replacement of [the old PHP
script](http://saturnboy.com/2010/03/scraping-google-groups/) for
downloading messages stored in the black hole of the Google Groups.

## How to use it?

This script requires `formail(1)` from `procmail` package. Any version
is OK, so please install it from your distribution's repositories. Then
run:

    pip install beautifulsoup4 PyYAML
    python gg_scraper.py 'https://groups.google.com/forum/#!forum/<group_name>'

## Background

I would never start without an inspiration from the
[comment](https://luther.ceplovi.cz/blog/2013/09/19/we-should-stop-even-pretending-google-is-trying-to-do-the-right-thingtm/#comment-133-by-sean-hogan)
by Sean Hogan on my previous post on the theme of locked down nature of
Google Groups:

> At least google-groups appears to follow google\'s own advice on
> making AJAX sites accessible to the google web-crawler. See
> <https://developers.google.com/webmasters/ajax-crawling/docs/getting-started>
>
> So for me,
> <http://groups.google.com/d/forum/jbrout:FBv1oMXRZkxB6YShaFuNHc3-Moc&cuid=3654582>
> redirects (eventually) to
> <https://groups.google.com/forum/#!forum/jbrout> which can be viewed
> in raw HTML as
> <https://groups.google.com/forum/?_escaped_fragment_=forum/jbrout>
>
> Google Groups seems a perfect use-case for
> extreme-progressive-enhancement, but what would I know.
>
> regards, Sean

------------------------------------------------------------------------

All issues, questions, complaints, or (even
better!) patches should be send via email to
[~mcepl/<devel@lists.sr.ht>](mailto:~mcepl/devel@lists.sr.ht)
email list (for patches use [git
send-email](https://git-send-email.io/)). For the issue tracking
I use [git-bug](https://github.com/MichaelMure/git-bug) in this
repo.