Computational Snowball Sampling as Methodological Parallel to Interview-Based Network Discovery

The universal scraper’s user discovery architecture (scripts/universal_reddit_scraper.py) implements snowball sampling parallel to qualitative interview methodology. Phase 1 user discovery (lines 283-336) extracts usernames from recent submissions, top posts, and controversial content to establish seed participants. Phase 1.2 network expansion (lines 338-391) performs iterative referral chain discovery: expand_user_network() examines each user’s submission history (limit=100), extracts all commenters from those threads, records discovery method as network_expansion_{source_username} (line 376), and adds users to the processing queue. The three-iteration expansion loop (max_iterations: int = 3) follows snowball interview protocols where respondents identify peers who identify further contacts, with each iteration documented through discovery_method database field enabling tracing of referral chains back to seed users.

Both sampling methods share logic and limitations. Interview-based snowball sampling risks homophily bias where similar individuals refer similar contacts; computational user network expansion concentrates discovery within interaction clusters where commenters engage each other’s posts, undersampling isolated or peripheral members. The scraper’s batch_users.add(comment.author.name) operation (line 375) parallels ethnographers noting “participant X mentioned colleague Y” in field notes, with the database users table serving as contact log tracking referral source and processing status. Rate limiting configuration (self.rate_limits['user_discovery']: 1.0 seconds, line 378) introduces temporal constraint absent in human interviews while serving protective function—preventing platform detection corresponds to maintaining research ethics protocols. The implementation preserves referral chains through discovery_method field allowing analysis of user network propagation: initial seeds tagged as subreddit_post, first-degree contacts as network_expansion_{seed_username}, creating genealogy enabling questions like “what percentage of corpus traces to which seed users?”

Traditional snowball sampling faces interviewer effects and participant gatekeeping; computational version encounters API restrictions and account accessibility issues where AttributeError on user.created_utc (lines 362-366) excludes suspended, deleted, or privacy-protected accounts. The checkpoint system (_save_checkpoint() line 390) enables resumption after interruption, with discovered_users, processed_users, and failed_users sets (lines 77-79) tracking recruitment stages. The corpus of 24,270 CUNY users represents network sample rather than random sample, where inclusion probability correlates with interaction frequency and referral chain proximity to seed users. Snowball samples overrepresent socially central individuals while undersampling isolates, newcomers, and those with privacy restrictions or API blocks.

Evidence Base:

Implementation: scripts/universal_reddit_scraper.py lines 283-391 (Phase 1 and 1.2 user discovery)
Key Methods: discover_users_phase1() (initial seeds), expand_user_network() (iterative expansion), _record_user() (referral tracking)
Database Architecture: users table with discovery_method field enabling referral chain reconstruction
Corpus Scale: 24,270 unique CUNY users discovered through iterative network sampling
Methodological Parallel: Qualitative snowball sampling translated to computational network traversal with preserved referral genealogy

Log Guidelines

This log documents the computational ethnography research process through evidence-anchored academic narrative. Each entry transforms data discoveries into flowing scholarly prose that:

Integrates evidence seamlessly - Submission and comment IDs function as inline citations within narrative flow
Preserves student voices - Vernacular expressions reveal structural conditions
Connects individual to structural - Specific findings illuminate systemic patterns
Employs theoretical frameworks - Natural integration of scholarly concepts
Maintains temporal awareness - Patterns reveal institutional gaps
Creates analytical bridges - Connects findings to emerging research questions

For technical implementations, use structured format with sections for: What Happened, Technical Details, Challenges Encountered, Solutions & Outcomes, Research Implications, and Next Steps.