Computational Snowball Sampling as Methodological Parallel to Interview-Based Network Discovery
The universal scraper’s user discovery architecture (scripts/universal_reddit_scraper.py) implements snowball sampling parallel to qualitative interview methodology. Phase 1 user discovery (lines 283-336) extracts usernames from recent submissions, top posts, and controversial content to establish seed participants. Phase 1.2 network expansion (lines 338-391) performs iterative referral chain discovery: expand_user_network()
examines each user’s submission history (limit=100), extracts all commenters from those threads, records discovery method as network_expansion_{source_username}
(line 376), and adds users to the processing queue. The three-iteration expansion loop (max_iterations: int = 3
) follows snowball interview protocols where respondents identify peers who identify further contacts, with each iteration documented through discovery_method
database field enabling tracing of referral chains back to seed users.
Both sampling methods share logic and limitations. Interview-based snowball sampling risks homophily bias where similar individuals refer similar contacts; computational user network expansion concentrates discovery within interaction clusters where commenters engage each other’s posts, undersampling isolated or peripheral members. The scraper’s batch_users.add(comment.author.name)
operation (line 375) parallels ethnographers noting “participant X mentioned colleague Y” in field notes, with the database users
table serving as contact log tracking referral source and processing status. Rate limiting configuration (self.rate_limits['user_discovery']: 1.0
seconds, line 378) introduces temporal constraint absent in human interviews while serving protective function—preventing platform detection corresponds to maintaining research ethics protocols. The implementation preserves referral chains through discovery_method
field allowing analysis of user network propagation: initial seeds tagged as subreddit_post
, first-degree contacts as network_expansion_{seed_username}
, creating genealogy enabling questions like “what percentage of corpus traces to which seed users?”
Traditional snowball sampling faces interviewer effects and participant gatekeeping; computational version encounters API restrictions and account accessibility issues where AttributeError
on user.created_utc
(lines 362-366) excludes suspended, deleted, or privacy-protected accounts. The checkpoint system (_save_checkpoint()
line 390) enables resumption after interruption, with discovered_users
, processed_users
, and failed_users
sets (lines 77-79) tracking recruitment stages. The corpus of 24,270 CUNY users represents network sample rather than random sample, where inclusion probability correlates with interaction frequency and referral chain proximity to seed users. Snowball samples overrepresent socially central individuals while undersampling isolates, newcomers, and those with privacy restrictions or API blocks.
Evidence Base:
- Implementation:
scripts/universal_reddit_scraper.py
lines 283-391 (Phase 1 and 1.2 user discovery) - Key Methods:
discover_users_phase1()
(initial seeds),expand_user_network()
(iterative expansion),_record_user()
(referral tracking) - Database Architecture:
users
table withdiscovery_method
field enabling referral chain reconstruction - Corpus Scale: 24,270 unique CUNY users discovered through iterative network sampling
- Methodological Parallel: Qualitative snowball sampling translated to computational network traversal with preserved referral genealogy
Log Guidelines
This log documents the computational ethnography research process through evidence-anchored academic narrative. Each entry transforms data discoveries into flowing scholarly prose that:
- Integrates evidence seamlessly - Submission and comment IDs function as inline citations within narrative flow
- Preserves student voices - Vernacular expressions reveal structural conditions
- Connects individual to structural - Specific findings illuminate systemic patterns
- Employs theoretical frameworks - Natural integration of scholarly concepts
- Maintains temporal awareness - Patterns reveal institutional gaps
- Creates analytical bridges - Connects findings to emerging research questions
For technical implementations, use structured format with sections for: What Happened, Technical Details, Challenges Encountered, Solutions & Outcomes, Research Implications, and Next Steps.