⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

priya-gitTest/JOSS_SoftwareRepositoryExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 JOSS + Helmholtz(RSD) Software Repository Extractor

JOSS Python GitHub Codespaces AI Generated

An intelligent Python tool to extract and catalog software repositories from JOSS published papers

🎯 Features🚀 Quick Start📊 Output🛠️ Usage📈 Statistics


🎯 Features

Fast & Efficient

  • Scans all 3,100+ published papers in minutes (2 mins approx)
  • Rate-limited API calls to respect server
  • Progress tracking with real-time updates

🎯 Smart Extraction

  • Filters out papers without repositories
  • Handles edge cases and malformed URLs
  • Comprehensive error handling

📊 Detailed Analytics

  • Processing vs output record counts
  • Repository coverage statistics
  • Data integrity verification

📁 Professional Output

  • Timestamped CSV files
  • Quoted URL format
  • UTF-8 encoding support

🚀 Quick Start

🐙 GitHub Codespaces (Recommended)

Open in GitHub Codespaces

# 1. Open in Codespaces (click badge above)
# 2. Install uv
# 3. Create the Virtual Environment
# 4. Activate the environment (#Linux)
pip install uv
uv venv
source .venv/bin/activate
# Run only if the Requirement file is present
uv pip install -r requirements.txt

5. Install Packages

# Run only if the Requirement file is absent
uv pip install requests beautifulsoup4

6. Run the JOSS extractor

python joss_extractor.py

7. Run the Helmholtz(RSD) extractor

python helmholtzRSD_extractor.py

📊 Output

The script generates a timestamped CSV file with software repositories:

software_repository
"https://github.com/example/awesome-tool"
"https://gitlab.com/research/data-analyzer"
"https://codeberg.org/dev/ml-framework"

📁 File Naming Convention

joss_repositories_YYYYMMDD_HHMMSS.csv
Helmholtz_software_repositories_YYYYMMDD_HHMMSS.csv

Example: joss_repositories_20250805_143022.csv

🛠️ Usage

Basic Usage

python joss_extractor.py
python helmholtzRSD_extractor.py

Expected Output

🚀 JOSS Papers Data Extractor
==================================================
🕒 Started at: 2025-08-05 14:30:15

Fetching page 1/156...
  → Retrieved 20 papers (Total: 20)
Fetching page 2/156...
  → Retrieved 20 papers (Total: 40)
...

============================================================
📊 EXTRACTION SUMMARY
============================================================
📥 Total papers processed: 3,111
📝 Records written to CSV: 3,089
❌ Papers without repositories: 22
📈 Repository coverage: 99.3%
📁 Output file: joss_repositories_20250805_143022.csv
🕒 Extraction completed at: 2025-08-05 14:32:18

🔍 VERIFICATION:
✅ Processed 3,111 papers from API
✅ Wrote 3,089 repository URLs to CSV
✅ Data integrity: 3,089 + 22 = 3,111 ✓

⏱️ Total execution time: 123.4 seconds

📈 Statistics

Metric Typical Value
Total Papers ~3,100+
Repository Coverage ~99%
Execution Time 2-5 minutes
Output Size ~200KB
API Pages ~156 pages

🔧 Technical Details

Requirements

  • Python 3.6+
  • requests library
  • Internet connection

API Details

  • Base URL: https://joss.theoj.org/papers/published.json
  • Pagination: 20 records per page
  • Total Pages: ~156 pages
  • Rate Limiting: 100ms delay between requests

Data Processing

  1. Fetch all pages from JOSS API
  2. Filter papers with valid repository URLs
  3. Format URLs with explicit quotes
  4. Export to timestamped CSV file
  5. Verify data integrity

🤝 Contributing

This project was generated with the assistance of Claude AI. Contributions are welcome!

  1. Fork the repository

  2. Create your feature branch (git checkout -b feature/AmazingFeature)

  3. Commit your changes (git commit -m 'Add some AmazingFeature')

  4. Push to the branch (git push origin feature/AmazingFeature)

  5. Open a Pull Request

  6. [TODO : Fix Licence extraction logic for non GITHUB repo's]

📄 License

This project is open source and available under the MIT License.

🙏 Acknowledgments


Made with ❤️ and AI assistance

GitHub stars GitHub forks

About

SoftwareRepositoryExtractor [ JOSS + Helmholtz RSD ]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages