Initial Fishbowl Discussion

From Data-Intensive Collaboration in Science and Engineering
Jump to: navigation, search

CI of social science (Cogburn):

  • The movement of practices centered on younger scholars who have the flexibility of exploring new methodologies: (Andrea)

Norms of Data-/Code- Sharing (Steph):

  • i-Dash: NLP for medical records (Mary Chapman)
  • Is LDAP (“corporate infrastructure”) too rigid?
  • What do access rights around data tell us about social phenomena?
  • Prioritization/agency
  • Emerging paradigms encoded in technological change: how do we fill the tensions of interdisciplinary work w/ varying values
  • Valued data needs to be fore-fronted for real-time research: is this part of the open/closed or is this orthogonal to the tension of developed norms in interdisciplinary convergence?
  • The “new” NSF data management plan and it’s implications
  • Sharing of code/software:
  • SBGrid (Amber) gathering information about software usage including time and location information
  • To develop standardized platforms/recommendations to eliminate duplication of effort
  • Incentivize sustainability and maintenance by developers
  • Limited by what is permissible in each domain: potential venue for visualizations that depict high-level use at run-time (visualize the complex workflow and emerging paradigms over time)
  • The “scientist developer” (Jeff) does not necessarily have incentive to do this! Almost reverse incentive to train a doctoral student because they may not finish and become an engineer!
  • Does citation counts alter the incentive environment?

CI-enabled research projects vs CI centers (Joe):

  • What is the role of the centers?
  • Rhetoric/marketing of CI centers and attempting to attract social science/humanities.
  • Persistence of centers vs transient nature of research projects
  • Stewards “at the bottom of the stack” vs “at the top of the stack”
  • Move away from the term “supercomputing center” because the breadth of their work has expanded (Charlotte/Matt’s work w SDSC)
  • Switch of centers to service providers (“service dominant logic”)
  • Business value: looking at the corporate documents

Who is involved in the infrastructure? (Kirk)

  • Focus on the use of the infrastructure: the people outside the group that built it/ those who adopted it: Trust/Reliability/Provenance around sharing tools - “would rather write my own analysis routine than trust another tool that may have not been built to do specifically my set of tasks”
  • Policy makers, computational scientists, domain scientists, etc...
  • Pockets of activities that may not be connected with eachother, but are routines that sometimes sustain and become adopted and an infrastructure may emerge
  • Have projects that promote sustainability and broader impacts actually do that?
  • Resources endogenously connected to the collaboration: the value space of the endogenous project**

Interaction of eScience Centers and those developing infrastructures (Bill):

  • Could we convert the library to the eScience Centers and provide solutions for modern research? (“That beautiful building in the center of campuses”)
  • Librarians have the skills to tackle some of these issues
  • While the stacks going away, the traditional library services are not
  • Licensing, collection development, indexing
  • Infrastructural evaluation as the handmaiden of science (they do not consider themselves handmaidens, they are integral parts of research teams: written into grants)
  • Librarians in non-traditional spaces (Betsy’s biomedical research)
  • Cyberinfrastructure facilitators (Cogburn): data management + collaboration + communication = How do you train people to do all of this?
  • Venture capitalists & “incubators”: small funding to an undergrad-type (ex: Y-Combinator)
  • “How do you succeed? Fail faster.”
  • How can we incentivize or develop a model that increases this dialog?
  • eScience vs iPlant vs “Data-Intensive Science Center” vs “Data Science” vs cyberinfrastructure
  • “If you want to see the metaphor of the hornets’ nest play out... you should approach librarians about how to change their library” - Joe
  • The “clash” of domain scientists and computer scientists
  • eScience doesn’t fit into a traditional department and tends to not have a home
  • ad hoc nature of developing infrastructure for a single project: 40 different languages in one workflow

The benefits of introducing computer science into hard science (Khalid):

  • Adoption of agile processes in development
  • The EU framework funding mechanism for computer science is different from US: difference in demonstration of novelty vs function
  • Research Object Management: data sets, annotations, results, provenance of results and conclusions themsleves captured in a single object
  • Best practices and techniques that can be used to “repair” a research object

Bringing in the perspective of the citizen scientist (Dana):

  • Facilitating complex tasks otherwise insurmountable by an individual lab because of constraints like funding, time, storage, etc
  • Retention and resuse of data can be facilitated by a content creation community of volunteers, which fosters community norms
  • Encyclopedia of Life: documents and archives all species on Earth
  • Includes domain scientists, but many contributors are volunteer (data, observations, etc) that is later vetted by professionals and then added to the Encyclopedia
  • Design of crowdsourcing systems:
    • The technological barrier of crowdsourcing that maintains validity and vetting process, and also curates the content and engages the volunteer
    • Social issues of practices, norms, convergence of different populations
    • Design ideas for supporting crowdsourced content creation in science
    • Feedback and training? What is the cost of training and who bears the cost?
    • These issues are entirely context-specific: there is not a cure-all

Data re-use by Post-Docs in Epidemiologic Science (Betsy):

  • Context of the data is not ever captured solely by the metadata, need interpersonal communication to determine validity/quality of the data
  • “Can data actually be reused when there is no connection between the data collection and the research project?”
  • Does open data remove the knowledge of who is using the data and how?
  • Authorship generally follows data collection
  • Efficiency? “If I’m sharing my data with 30 people, I’m getting 30 people asking me questions.”
  • To share or not to share data: Bietz & Birnholtz - risk and rewards evaluated on case-by-case basis*

Has humanities solved these big data problems before? (Ben):

  • Observe, store and retrieve data: “Are we simply extending the spreadsheet to the point where it breaks?” “Are we sacrificing what that data was supposed to inform?”

How does this play in the corporate world? (Mark):

  • Boeing is closest in structure to drug development and nuclear power plants
  • A lot of tiny applications (~1500 official, + unofficial): all of which government regulation really wants to know about for safety! Build/buy decision. “A week of coding saves an hour *of research” (Bill)
  • Agency & Power in corporate research:
    • Funding model: one grant!
    • Very top-down “go build this” structure
    • Very clear distinction between IT (cost center) and business (profit center + power)
    • Would have taken 10 mil to build it in-house but spent 100 mil to buy it
  • Data and instrument sharing: how do you define rights and accesses in corporate environments, particularly when there is corporate turnover and company acquisition
  • “Data is dirty everywhere.”
Personal tools