blob: d75cbbc279b1d8eae8a4127eb97b57423f17e1e7 (
plain) (
tree)
|
|
=====================
Google Group Scrapper
=====================
.. image:: https://secure.travis-ci.org/mcepl/gg_scraper.png
:alt: Build Status
A small script as a replacement of `the old PHP script`_ for downloading messages stored in the black hole of the Google Groups.
.. _`the old PHP script`:
http://saturnboy.com/2010/03/scraping-google-groups/
How to use it?
--------------
This script requires ``formail(1)`` from ``procmail`` package. Any
version is OK, so please install it from your distribution’s
repositories. Then run:
::
pip install beautifulsoup4 PyYAML
python gg_scraper.py 'https://groups.google.com/forum/#!forum/<group_name>'
Background
----------
I would never start without an inspiration from the comment_ by Sean Hogan on my previous post on the theme of locked down nature of Google Groups:
At least google-groups appears to follow google's own advice on making AJAX sites accessible to the google web-crawler. See https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
So for me,
http://groups.google.com/d/forum/jbrout:FBv1oMXRZkxB6YShaFuNHc3-Moc&cuid=3654582
redirects (eventually) to
https://groups.google.com/forum/#!forum/jbrout
which can be viewed in raw HTML as
https://groups.google.com/forum/?_escaped_fragment_=forum/jbrout
Google Groups seems a perfect use-case for extreme-progressive-enhancement, but what would I know.
regards,
Sean
.. _comment:
https://luther.ceplovi.cz/blog/2013/09/19/we-should-stop-even-pretending-google-is-trying-to-do-the-right-thingtm/#comment-133-by-sean-hogan
Current bugs are filled at my bugzilla_ and new ones can be reported via
email (one of many of my addresses are available on my `Github page`_)
.. _bugzilla:
https://luther.ceplovi.cz/bugzilla/buglist.cgi?quicksearch=product%3Agg_scraper
.. _`Github page`:
https://github.com/mcepl
Of course pull requests are more than welcome in the same places as
well. Currently all development is done with Python 3.3, but tests are
run on Travis-CI_ for 2.6, 2.7, pypy, and 3.2 as well.
.. _Travis-CI:
https://travis-ci.org/mcepl/gg_scraper/
|