Internships

Data Pipeline

Here you find the details for the internship named "Data Pipeline" in the company Western Digital.

Details
Name: Data Pipeline
Company: Western Digital
Description:

Implement an ActiveScale “Data Pipeline”

Description
Every time an object is written, the system inserts a message into an Apache Kafka message queue (https://kafka.apache.org/). This way further processing can (asynchronously) happen on the freshly written data. This is called the “Data Pipeline” and is key in modern big data architectures.

The goal of the internship is to set up a Kafka cluster and a stream processing framework to produce a suite of interesting post-processing applications. Depending on your interests and prior experience we can choose from the following areas:
• Images
o Uploaded image could be resized, auto-enhanced, filtered, ... The resulting artifacts would be reuploaded to the object store as auxiliary objects
o Feed to an image recognition algorithm (self-written or in the cloud) to categorize, tag, … the content and push the results to an external database / tool
• Video
o Transcode, post-process, … uploaded video and reupload as additional object
o Feed audio to a speech recognition algorithm (self-written or in the cloud) to autogenerate subtitles/transcripts
• Metrics
o Compute & visualize system statistics (average, histograms, percentiles, …) and metrics on object name and data size, object lifetime, capacity use per bucket, …
• Blockchain
o Name + MD5sum could be fed to blockchain / merkle tree to do some sort of ‘digital notarization’
• Other
o If you’re passionate about an interesting application, that’s even better.

You will become familiar with cloud industry protocols such as the Amazon S3 API and open-source projects (Apache Kafka, stream processing) as well as build valuable coding, prototyping and debugging experience of distributed and cloud-based applications.

Technology
• Programming language of your choice: Python, Java, C++, Go, …
• AWS S3 API
• Apache Kafka

Goal
Create a demo that we can show to our customers to demonstrate the Data Pipeline.

Target profiles:
  • Burgerlijk Ingenieur - Computer Science Engineering
  • Computer Science
In industries:
  • Technologie
  • IT
Required special knowledge:

Duration: 6 weeks
Paid: Nee
Net wage: -
Foreign: Nee
Contact: Olivier Gustin (HR Manager)
Email: recruiter@amplidata.com
Tel: