Usage

To use Python client for dataquality.pl in a project:

from dq import DQClient, JobConfig


dq = DQClient('https://app.dataquality.pl', user='<USER_EMAIL>', token='<API_TOKEN>')

API token can be obtain on the page “Moje konto”.

Account

Check account status:

account = dq.account_status()

print(account.email)          # user email
print(account.balance)        # account balance
print(account.total_records)  # processed records

Jobs

List jobs

jobs = dq.list_jobs()

for job in jobs:
    print(job.id)                # job id
    print(job.name)              # human readable job name
    print(job.status)            # job status
    print(job.start_date)        # job start date
    print(job.end_date)          # job end date
    print(job.source_records)    # how many records were applied
    print(job.processed_records) # how many records were processed
    print(job.price)             # price for processed records

Create new job

input_data = '''"ID","ADRES"
6876,"34-404, PYZÓWKA, PODHALAŃSKA 100"
'''

job_config = JobConfig('my job')
job_config.input_format(field_separator=',', text_delimiter='"', has_header=True)
job_config.input_column(0, name='ID', function='PRZEPISZ')
job_config.input_column(1, name='ADRES', function='DANE_OGOLNE')
job_config.module_std(address=1)
job_config.extend(gus=True, geocode=True)

job = dq.submit_job(job_config, input_data=input_data)                                         # with data in a variable

job = dq.submit_job(job_config, input_file='my_file.csv')                                      # with data inside file

print(job.id)
print(job.name)
print(job.status)
...

Create new deduplication job

input_data = '''unikalne_id;imie_i_nazwisko;kod_pocztowy;miejscowosc;adres;email;tel;CrmContactNumber;data
1;Jan Kowalski;37-611;Cieszanów ;Dachnów 189;abc@wp.pl;605936000;abc123;2017-11-08 12:00:00.000
2;Adam Mickiewicz Longchamps de Berier;66-400;Gorzów Wlkp.;Widok 24;qqq@ft.com;48602567000;a2b2c2;2017-11-08 12:00:00.000
3;Barbara Łęcka;76-200;Słupsk;Banacha 7;bb@gazeta.pl;79174000;emc2;2017-11-08 12:00:00.000
4;KAROL NOWAK;22-122;LEŚNIOWICE;RAKOLUPY DU—E 55;kn@ll.pp;0;f112358;2017-11-08 12:00:00.000
5;Anna Maria Jopek;34-722;Podwilk;Podwilk 464;amj@gmail.com;606394000;eipi10;2017-11-08 12:00:00.000
6;Mariusz Robert;37-611;Cieszanów ;Dachnów 189;abc@wp.pl;605936000;abc123;2017-11-08 12:00:00.000
'''

job_config = JobConfig('pr2')
job_config.input_format(field_separator=';', text_delimiter='"', has_header=True)
job_config.input_column(0, name='unikalne_id', function='ID_REKORDU')
job_config.input_column(1, name='imie_i_nazwisko', function='IMIE_I_NAZWISKO')
job_config.input_column(2, name='kod_pocztowy', function='KOD_POCZTOWY')
job_config.input_column(3, name='miejscowosc', function='MIEJSCOWOSC')
job_config.input_column(4, name='adres', function='ULICA_NUMER_DOMU_I_MIESZKANIA')
job_config.input_column(5, name='email', function='EMAIL1')
job_config.input_column(6, name='tel', function='TELEFON1')
job_config.input_column(7, name='CrmContactNumber', function='PRZEPISZ')
job_config.input_column(8, name='data', function='CZAS_AKTUALIZACJI')
job_config.deduplication(on=True)
job_config.module_std(address=True, names=True, contact=True)
job_config.extend(gus=True, geocode=True, diagnostic=True)

job = dq.submit_job(job_config, input_data=input_data)

print(job)
...

Available column functions:

  • addresses
    • KOD_POCZTOWY
    • MIEJSCOWOSC
    • ULICA_NUMER_DOMU_I_MIESZKANIA
    • ULICA
    • NUMER_DOMU
    • NUMER_MIESZKANIA
    • NUMER_DOMU_I_MIESZKANIA
    • WOJEWODZTWO
    • POWIAT
    • GMINA
  • names
    • IMIE
    • NAZWISKO
    • NAZWA_PODMIOTU
    • IMIE_I_NAZWISKO
  • people/companies
    • PESEL
    • NIP
    • REGON
  • contact
    • EMAIL1
    • EMAIL2
    • TELEFON1
    • TELEFON2
  • dates
    • DATA_URODZENIA
    • CZAS_AKTUALIZACJI
  • mixed
    • DANE_OGOLNE
  • id
    • ID_REKORDU
  • others
    • PRZEPISZ
    • POMIN

To process input columns, you must enable the corresponding module. Method module_std is used to set active modules:

  • address
  • names
  • contact
  • id_numbers

For address module to be started it is necessary to ensure at least one column with the role listed below:

  • DANE_OGOLNE
  • KOD_POCZTOWY
  • MIEJSCOWOSC

Analogously for other modules:

  • names require one of
    • DANE_OGOLNE
    • IMIE
    • NAZWISKO
    • IMIE_I_NAZWISKO
    • NAZWA_PODMIOTU
  • contact
    • DANE_OGOLNE
    • EMAIL1
    • EMAIL2
    • TELEFON1
    • TELEFON2
  • id
    • DANE_OGOLNE
    • PESEL
    • NIP
    • REGON

Check job state

state = dq.job_state('3f14e25e-9f6d-41ff-a4cb-942743a37b73')  # input parameter: job id

print(state)                                                  # 'WAITING' or 'FINISHED'

Cancel job

dq.cancel_job('3f14e25e-9f6d-41ff-a4cb-942743a37b73')  # input parameter: job id

Retrieve job report

report = dq.job_report('3f14e25e-9f6d-41ff-a4cb-942743a37b73')  # input parameter: job id

print(report.quality_issues)
print(report.quality_names)
print(report.results)

Save job results

dq.job_results('3f14e25e-9f6d-41ff-a4cb-942743a37b73', 'output.csv')

Delete job and its results

dq.delete_job('3f14e25e-9f6d-41ff-a4cb-942743a37b73')  # input parameter: job id