fbcrawl

A facebook web crawler written in Perl.
git clone git://seanh.sh/fbcrawl
Log | Files | Refs | README | LICENSE

README.md (2976B)


      1 # Proof of Concept:  Facebook Scraper
      2 
      3 This tool was built as a proof of concept to
      4 demonstrate how easily data can be scraped
      5 from Facebook's mobile webapp.
      6 
      7 ## Installation
      8 
      9 ### Dependencies
     10 
     11 * BASH
     12 * Perl
     13 * POSIX
     14 * A Perl DOM library (Mojo::DOM)
     15 
     16 Install Mojo::DOM by running `cpan` in terminal, and executing `install Mojo::DOM;`
     17 
     18 ## Instructions
     19 
     20 Place your cookie body in a file somewhere, maybe /home/You/.fbcookie
     21 
     22 .fbcookie:
     23 
     24 ```Cookie: datr=xxx; fr=xxx; sb=xxx; wd=xxx; c_user=xxx; xs=xxx; ...```
     25 
     26 You can get your facebook cookies by inspecting the headers on
     27 your browser's network tools panel during a request to facebook, or in
     28 your browser's settings page (google it for your respective browser).
     29 
     30 Next, specify the crawler's start point as a *+username* (prepend *your.username* with a '+')
     31 or an *ID* (this should be a number), by placing it in the `todo` file:
     32 
     33 ```echo +your.username > todo```
     34 
     35 Finally, run the crawler like so:
     36 
     37 ```sh start.sh path/to/your/.fbcookie```
     38 
     39 ## Where does the data go?
     40 
     41 Raw html from friends lists goes into `./tmp`
     42 
     43 Empty files named *+username* or *ID* go into `./done` to mark a profile as *scraped*.
     44 
     45 Scheduled profiles to be scraped are added (FIFO style) to `todo`.
     46 This file starts out small (just the start point), grows fast, then starts to
     47 shrink as you reach the edges of your extended network on facebook (note the
     48 crawler doesn't scrape the profiles of non-friends with no friends in common with you).
     49 
     50 `./names` is filled gradually with files named with *+username* or *ID*, containing
     51 a single line with that user's name.
     52 
     53 `./friends` is filled with directories for each user whose friends list is scraped.
     54 The directories are named *+username* or *ID*, and contain files whose names
     55 represent that user's friends.  E.g. if I have N friends on facebook `./friends/+my.username`
     56 would be populated with N files, the files having the *ID* or *+username* of those
     57 friends as their file name (the files are empty).
     58 
     59 ## Cleanup
     60 
     61 ```sh clean.sh```
     62 
     63 ## Disclaimer
     64 
     65 I wrote this program for it to be appreciated, not to be run.  Expect to see some
     66 of your facebook activity limited after a few hours running the script, if
     67 you do run it, which I don't recommend.
     68 
     69 The source code is 140 lines.  If you do run the script, read the source code;
     70 common sense works better than code signatures.
     71 
     72 ## Why is the data stored in such a weird format?
     73 
     74 I wrote this set of perl and BASH scripts in an evening over
     75 a nice glass of wine.  Quick and dirty was my motto.  Nowadays I use sqlite for quick setup
     76 storage while doing small things like this.  At the time of writing these scripts,
     77 my brain's JIT learning algorithm had not yet crossed paths with sqlite.
     78 
     79 I used empty files in a folder to model a *set* data structure, non-empty files
     80 in a folder to model a *map*, a file with multiple lines (trimmed from the top, enlarged from the bottom)
     81 to model a queue, a folder of folders of files is a map of strings to sets, etc.