How to turn a pdf into a text searchable pdf?Where do I get a package for GNU Parallel?Adding OCR info to a PDFRemove text information from a PDF?How to add OCRed text to original pdf in gscan2pdf?How do I type text on top of a PDF?How do I edit text in a scanned .jpeg?How do I convert a scanned PDF into a PDF with textNeed a PDF editor that will let me highlight textPDF viewer with selectable textConversion images pdf to textFonts supported by Paps(text to pdf conversion)PDF files: text not showing

Anime with legendary swords made from talismans and a man who could change them with a shattered body

How to make a list of partial sums using forEach

How to make money from a browser who sees 5 seconds into the future of any web page?

How to get directions in deep space?

El Dorado Word Puzzle II: Videogame Edition

Why can't the Brexit deadlock in the UK parliament be solved with a plurality vote?

Do I have to know the General Relativity theory to understand the concept of inertial frame?

Proving an identity involving cross products and coplanar vectors

Is there anyway, I can have two passwords for my wi-fi

Why is the sun approximated as a black body at ~ 5800 K?

What does "tick" mean in this sentence?

Deciphering cause of death?

Determining multivariate least squares with constraint

How would a solely written language work mechanically

Is there a reason to prefer HFS+ over APFS for disk images in High Sierra and/or Mojave?

Would a primitive species be able to learn English from reading books alone?

What is the meaning of "You've never met a graph you didn't like?"

What is this high flying aircraft over Pennsylvania?

If the only attacker is removed from combat, is a creature still counted as having attacked this turn?

Is there a RAID 0 Equivalent for RAM?

Telemetry for feature health

Unable to disable Microsoft Store in domain environment

How do I fix the group tension caused by my character stealing and possibly killing without provocation?

Showing mass murder in a kid's book



How to turn a pdf into a text searchable pdf?


Where do I get a package for GNU Parallel?Adding OCR info to a PDFRemove text information from a PDF?How to add OCRed text to original pdf in gscan2pdf?How do I type text on top of a PDF?How do I edit text in a scanned .jpeg?How do I convert a scanned PDF into a PDF with textNeed a PDF editor that will let me highlight textPDF viewer with selectable textConversion images pdf to textFonts supported by Paps(text to pdf conversion)PDF files: text not showing













12















I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?



Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).




  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)


  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)

  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.


  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.

  5. I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?










share|improve this question



















  • 3





    I haven't tried it out myself, yet, but I've seen this project get recommended in the past.

    – Glutanimate
    May 29 '14 at 21:22















12















I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?



Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).




  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)


  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)

  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.


  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.

  5. I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?










share|improve this question



















  • 3





    I haven't tried it out myself, yet, but I've seen this project get recommended in the past.

    – Glutanimate
    May 29 '14 at 21:22













12












12








12


5






I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?



Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).




  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)


  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)

  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.


  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.

  5. I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?










share|improve this question
















I have a number of scanned documents in pdf and I want to be able to search them. How can I do that?



Essentially I have to OCR the pdf and then blend the extracted text back into a new pdf. I have unsuccesfully tried a number of different solutions (including the ones found in Adding OCR info to a PDF).




  1. pdfocr (which gives me this issue: https://github.com/gkovacs/pdfocr/issues/7)


  2. pdfsandwich (of which the software center says it is a poor package and I should not install it)

  3. OCRfeeder (in the software center) exports to odt nicely, but does not react when exporting to pdf.


  4. Gscan2pdf exports an all black (but searchable) image as reported in this discussion.

  5. I don't think Pdfxchange viewer can handle doing ocr on the fly on files over 500 pages.

Is there a software package I am unaware of? Or a script that does this?







software-recommendation pdf ocr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 13 '17 at 12:24









Community

1




1










asked May 29 '14 at 9:37









don.joeydon.joey

17.8k126695




17.8k126695







  • 3





    I haven't tried it out myself, yet, but I've seen this project get recommended in the past.

    – Glutanimate
    May 29 '14 at 21:22












  • 3





    I haven't tried it out myself, yet, but I've seen this project get recommended in the past.

    – Glutanimate
    May 29 '14 at 21:22







3




3





I haven't tried it out myself, yet, but I've seen this project get recommended in the past.

– Glutanimate
May 29 '14 at 21:22





I haven't tried it out myself, yet, but I've seen this project get recommended in the past.

– Glutanimate
May 29 '14 at 21:22










4 Answers
4






active

oldest

votes


















11














Ubuntu < 16.04



Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.



git clone https://github.com/jbarlow83/OCRmyPDF
cd OCRmyPDF
sh ./OCRmyPDF.sh -h # to see the usage


If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):



sudo apt-get install parallel
sudo rm /etc/parallel/config


Finally you can OCR your pdf with the command:



sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want


If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



pdftk A=input.pdf cat A1-5 output output.pdf


Ubuntu >= 16.04



As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run



sudo apt install ocrmypdf
ocrmypdf -h # to see the usage


Finally you can OCR your pdf with the command:



ocrmypdf input.pdf output.pdf # change input and output to the files you want


If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



pdftk A=input.pdf cat A1-5 output output.pdf


If you have any question have a look in the new Github Repo.






share|improve this answer

























  • Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)

    – Registered User
    Jun 19 '14 at 13:37











  • Just sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF for Ubuntu 16.04

    – Martin Thoma
    Aug 14 '17 at 20:39







  • 1





    For Ubuntu 16.10 and later, you can just do sudo apt install ocrmypdf.

    – endolith
    Feb 26 '18 at 16:46


















4














pdfsandwich performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.



If you have a scanned file scanned_file.pdf, simply call



pdfsandwich scanned_file.pdf


which generates the file scanned_file_ocr.pdf with the recognized text added to the scanned pages.



Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.



DISCLAIMER: I'm the developer of pdfsandwich and therefore heavily biased.






share|improve this answer

























  • It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?

    – Valentas
    Dec 16 '16 at 16:04






  • 1





    That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.

    – Tobias Elze
    Jan 17 '17 at 1:39



















4














@don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).



sudo apt install ocrmypdf


Then you have to install the tesseract languages you need.



To list which languages are already in your system, type:



tesseract --list-langs


In case you miss one, install it. For instance,



sudo apt install tesseract-ocr-spa


Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command



ocrmypdf -l 'spa' old.pdf new.pdf


You can, of course, check its man page for some additional options.






share|improve this answer























  • Have my upvote sir!

    – don.joey
    Feb 13 '17 at 8:36


















0














OCRfeeder has a bug in



/usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py


line 436 should read:



 lines = asUnicode(stuff).strip().split('n')
# bug here, was:
# lines = 'n'.split(asUnicode(stuff).strip())


changed this and it worked for me






share|improve this answer






















    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "89"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f473843%2fhow-to-turn-a-pdf-into-a-text-searchable-pdf%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    11














    Ubuntu < 16.04



    Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.



    git clone https://github.com/jbarlow83/OCRmyPDF
    cd OCRmyPDF
    sh ./OCRmyPDF.sh -h # to see the usage


    If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):



    sudo apt-get install parallel
    sudo rm /etc/parallel/config


    Finally you can OCR your pdf with the command:



    sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    Ubuntu >= 16.04



    As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run



    sudo apt install ocrmypdf
    ocrmypdf -h # to see the usage


    Finally you can OCR your pdf with the command:



    ocrmypdf input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    If you have any question have a look in the new Github Repo.






    share|improve this answer

























    • Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)

      – Registered User
      Jun 19 '14 at 13:37











    • Just sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF for Ubuntu 16.04

      – Martin Thoma
      Aug 14 '17 at 20:39







    • 1





      For Ubuntu 16.10 and later, you can just do sudo apt install ocrmypdf.

      – endolith
      Feb 26 '18 at 16:46















    11














    Ubuntu < 16.04



    Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.



    git clone https://github.com/jbarlow83/OCRmyPDF
    cd OCRmyPDF
    sh ./OCRmyPDF.sh -h # to see the usage


    If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):



    sudo apt-get install parallel
    sudo rm /etc/parallel/config


    Finally you can OCR your pdf with the command:



    sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    Ubuntu >= 16.04



    As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run



    sudo apt install ocrmypdf
    ocrmypdf -h # to see the usage


    Finally you can OCR your pdf with the command:



    ocrmypdf input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    If you have any question have a look in the new Github Repo.






    share|improve this answer

























    • Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)

      – Registered User
      Jun 19 '14 at 13:37











    • Just sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF for Ubuntu 16.04

      – Martin Thoma
      Aug 14 '17 at 20:39







    • 1





      For Ubuntu 16.10 and later, you can just do sudo apt install ocrmypdf.

      – endolith
      Feb 26 '18 at 16:46













    11












    11








    11







    Ubuntu < 16.04



    Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.



    git clone https://github.com/jbarlow83/OCRmyPDF
    cd OCRmyPDF
    sh ./OCRmyPDF.sh -h # to see the usage


    If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):



    sudo apt-get install parallel
    sudo rm /etc/parallel/config


    Finally you can OCR your pdf with the command:



    sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    Ubuntu >= 16.04



    As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run



    sudo apt install ocrmypdf
    ocrmypdf -h # to see the usage


    Finally you can OCR your pdf with the command:



    ocrmypdf input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    If you have any question have a look in the new Github Repo.






    share|improve this answer















    Ubuntu < 16.04



    Following the comment of Glutanimate I have found a working solution. It is the OCRmyPDF script.



    git clone https://github.com/jbarlow83/OCRmyPDF
    cd OCRmyPDF
    sh ./OCRmyPDF.sh -h # to see the usage


    If you get a message saying you should install GNU parallel. It can be done (following https://askubuntu.com/a/298598/115155) with (the second line is optional and depends on your flavor and version):



    sudo apt-get install parallel
    sudo rm /etc/parallel/config


    Finally you can OCR your pdf with the command:



    sh ./OCRmyPDF.sh input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    Ubuntu >= 16.04



    As of Ubuntu 16.04 OCRmyPDF has become available through apt. Just run



    sudo apt install ocrmypdf
    ocrmypdf -h # to see the usage


    Finally you can OCR your pdf with the command:



    ocrmypdf input.pdf output.pdf # change input and output to the files you want


    If it seems the command is unresponsive, you can increase the verbosity using the -v flag (which can be used incrementally as -vv or -vvv). It might be best to test the results first on a shorter pdf. You can shorten a pdf as follows:



    pdftk A=input.pdf cat A1-5 output output.pdf


    If you have any question have a look in the new Github Repo.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited 2 hours ago

























    answered May 30 '14 at 8:20









    don.joeydon.joey

    17.8k126695




    17.8k126695












    • Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)

      – Registered User
      Jun 19 '14 at 13:37











    • Just sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF for Ubuntu 16.04

      – Martin Thoma
      Aug 14 '17 at 20:39







    • 1





      For Ubuntu 16.10 and later, you can just do sudo apt install ocrmypdf.

      – endolith
      Feb 26 '18 at 16:46

















    • Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)

      – Registered User
      Jun 19 '14 at 13:37











    • Just sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF for Ubuntu 16.04

      – Martin Thoma
      Aug 14 '17 at 20:39







    • 1





      For Ubuntu 16.10 and later, you can just do sudo apt install ocrmypdf.

      – endolith
      Feb 26 '18 at 16:46
















    Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)

    – Registered User
    Jun 19 '14 at 13:37





    Would you accept your answer, to resolve it?(so that it does not come in the unanswered list)

    – Registered User
    Jun 19 '14 at 13:37













    Just sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF for Ubuntu 16.04

    – Martin Thoma
    Aug 14 '17 at 20:39






    Just sudo -H pip install git+https://github.com/jbarlow83/OCRmyPDF for Ubuntu 16.04

    – Martin Thoma
    Aug 14 '17 at 20:39





    1




    1





    For Ubuntu 16.10 and later, you can just do sudo apt install ocrmypdf.

    – endolith
    Feb 26 '18 at 16:46





    For Ubuntu 16.10 and later, you can just do sudo apt install ocrmypdf.

    – endolith
    Feb 26 '18 at 16:46













    4














    pdfsandwich performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.



    If you have a scanned file scanned_file.pdf, simply call



    pdfsandwich scanned_file.pdf


    which generates the file scanned_file_ocr.pdf with the recognized text added to the scanned pages.



    Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.



    DISCLAIMER: I'm the developer of pdfsandwich and therefore heavily biased.






    share|improve this answer

























    • It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?

      – Valentas
      Dec 16 '16 at 16:04






    • 1





      That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.

      – Tobias Elze
      Jan 17 '17 at 1:39
















    4














    pdfsandwich performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.



    If you have a scanned file scanned_file.pdf, simply call



    pdfsandwich scanned_file.pdf


    which generates the file scanned_file_ocr.pdf with the recognized text added to the scanned pages.



    Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.



    DISCLAIMER: I'm the developer of pdfsandwich and therefore heavily biased.






    share|improve this answer

























    • It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?

      – Valentas
      Dec 16 '16 at 16:04






    • 1





      That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.

      – Tobias Elze
      Jan 17 '17 at 1:39














    4












    4








    4







    pdfsandwich performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.



    If you have a scanned file scanned_file.pdf, simply call



    pdfsandwich scanned_file.pdf


    which generates the file scanned_file_ocr.pdf with the recognized text added to the scanned pages.



    Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.



    DISCLAIMER: I'm the developer of pdfsandwich and therefore heavily biased.






    share|improve this answer















    pdfsandwich performs exactly this job. I wasn't aware that there is a package provided in the software center, but I'm providing Ubuntu deb packages for it on the project website (see http://www.tobias-elze.de/pdfsandwich/ for details), including the currently most recent version (0.1.2), which is unlikely to be in any software center yet.



    If you have a scanned file scanned_file.pdf, simply call



    pdfsandwich scanned_file.pdf


    which generates the file scanned_file_ocr.pdf with the recognized text added to the scanned pages.



    Compared to most existing solutions, it autodetects the tesseract version installed and adapts its behavior accordingly. In addition, it performs preprocessing of the scanned images prior to the OCR process, such as de-skewing or removal of dark edges etc., which can considerably improve optical character recognition.



    DISCLAIMER: I'm the developer of pdfsandwich and therefore heavily biased.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Oct 10 '15 at 12:44









    Nephente

    3,84611020




    3,84611020










    answered Jul 24 '14 at 14:29









    Tobias ElzeTobias Elze

    24923




    24923












    • It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?

      – Valentas
      Dec 16 '16 at 16:04






    • 1





      That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.

      – Tobias Elze
      Jan 17 '17 at 1:39


















    • It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?

      – Valentas
      Dec 16 '16 at 16:04






    • 1





      That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.

      – Tobias Elze
      Jan 17 '17 at 1:39

















    It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?

    – Valentas
    Dec 16 '16 at 16:04





    It sounds great, but why does pdfsandwich version 0.1.4 installed using apt-get convert each character into a black rectangle for me on Ubuntu 16.04?

    – Valentas
    Dec 16 '16 at 16:04




    1




    1





    That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.

    – Tobias Elze
    Jan 17 '17 at 1:39






    That's hard to answer without further details. First of all, I recommend to use a more recent version of the tool. The current version is 0.1.6. You can find deb packages for Ubuntu on the website. Second, if that does not help, you may want to use the option -verbose to get further details and use these details to file a bug report.

    – Tobias Elze
    Jan 17 '17 at 1:39












    4














    @don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).



    sudo apt install ocrmypdf


    Then you have to install the tesseract languages you need.



    To list which languages are already in your system, type:



    tesseract --list-langs


    In case you miss one, install it. For instance,



    sudo apt install tesseract-ocr-spa


    Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command



    ocrmypdf -l 'spa' old.pdf new.pdf


    You can, of course, check its man page for some additional options.






    share|improve this answer























    • Have my upvote sir!

      – don.joey
      Feb 13 '17 at 8:36















    4














    @don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).



    sudo apt install ocrmypdf


    Then you have to install the tesseract languages you need.



    To list which languages are already in your system, type:



    tesseract --list-langs


    In case you miss one, install it. For instance,



    sudo apt install tesseract-ocr-spa


    Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command



    ocrmypdf -l 'spa' old.pdf new.pdf


    You can, of course, check its man page for some additional options.






    share|improve this answer























    • Have my upvote sir!

      – don.joey
      Feb 13 '17 at 8:36













    4












    4








    4







    @don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).



    sudo apt install ocrmypdf


    Then you have to install the tesseract languages you need.



    To list which languages are already in your system, type:



    tesseract --list-langs


    In case you miss one, install it. For instance,



    sudo apt install tesseract-ocr-spa


    Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command



    ocrmypdf -l 'spa' old.pdf new.pdf


    You can, of course, check its man page for some additional options.






    share|improve this answer













    @don.joey answered with the ocrmypdf script. However, it can be installed directly now (from 16.10 onwards).



    sudo apt install ocrmypdf


    Then you have to install the tesseract languages you need.



    To list which languages are already in your system, type:



    tesseract --list-langs


    In case you miss one, install it. For instance,



    sudo apt install tesseract-ocr-spa


    Now you can produce a searchable PDF (whose quality will vary, depending on the scanned document) with the following command



    ocrmypdf -l 'spa' old.pdf new.pdf


    You can, of course, check its man page for some additional options.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Feb 11 '17 at 21:05









    LudenticusLudenticus

    23518




    23518












    • Have my upvote sir!

      – don.joey
      Feb 13 '17 at 8:36

















    • Have my upvote sir!

      – don.joey
      Feb 13 '17 at 8:36
















    Have my upvote sir!

    – don.joey
    Feb 13 '17 at 8:36





    Have my upvote sir!

    – don.joey
    Feb 13 '17 at 8:36











    0














    OCRfeeder has a bug in



    /usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py


    line 436 should read:



     lines = asUnicode(stuff).strip().split('n')
    # bug here, was:
    # lines = 'n'.split(asUnicode(stuff).strip())


    changed this and it worked for me






    share|improve this answer



























      0














      OCRfeeder has a bug in



      /usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py


      line 436 should read:



       lines = asUnicode(stuff).strip().split('n')
      # bug here, was:
      # lines = 'n'.split(asUnicode(stuff).strip())


      changed this and it worked for me






      share|improve this answer

























        0












        0








        0







        OCRfeeder has a bug in



        /usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py


        line 436 should read:



         lines = asUnicode(stuff).strip().split('n')
        # bug here, was:
        # lines = 'n'.split(asUnicode(stuff).strip())


        changed this and it worked for me






        share|improve this answer













        OCRfeeder has a bug in



        /usr/lib/python2.7/dist-packages/reportlab/pdfgen/textobject.py


        line 436 should read:



         lines = asUnicode(stuff).strip().split('n')
        # bug here, was:
        # lines = 'n'.split(asUnicode(stuff).strip())


        changed this and it worked for me







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 9 '17 at 22:24









        AndreRAndreR

        1




        1



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Ask Ubuntu!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f473843%2fhow-to-turn-a-pdf-into-a-text-searchable-pdf%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Are there any comparative studies done between Ashtavakra Gita and Buddhim?How is it wrong to believe that a self exists, or that it doesn't?Can you criticise or improve Ven. Bodhi's description of MahayanaWas the doctrine of 'Anatta', accepted as doctrine by modern Buddhism, actually taught by the Buddha?Relationship between Buddhism, Hinduism and Yoga?Comparison of Nirvana, Tao and Brahman/AtmaIs there a distinction between “ego identity” and “craving/hating”?Are there many differences between Taoism and Buddhism?Loss of “faith” in buddhismSimilarity between creation in Abrahamic religions and beginning of life in Earth mentioned Agganna Sutta?Are there studies about the difference between meditating in the morning versus in the evening?Can one follow Hinduism and Buddhism at the same time?Are there any prohibitions on participating in other religion's practices?Psychology of 'flow'

            fallocate: fallocate failed: Text file busy in Ubuntu 17.04? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)defragmenting and increasing performance of old lubuntu system with swap partitionIssue with increasing the root partition from the swapthis /usr/bin/dpkg returned error || ubuntu-16.04, 64bitDefault 17.04 swap file locationHow to Resize Ubuntu 17.04 Zesty Swap file size?Ubuntu freezes from online formsMy Laptop is not starting after upgrade ubuntu 16.04 (Kernel 4.8.0-38 to 04.10.0-36)hcp: ERROR: FALLOCATE FAILED!Not sure my swap is being usedWine 3.0 asking for more virtual free swap

            Where else does the Shulchan Aruch quote an authority by name?Parashat Metzora+HagadolPesach/PassoverShulchan Aruch UTF-8Anonymous glosses in the Shulchan AruchWhy is the Shulchan Aruch definitive?Siman 32, Kitzur Shulchan Aruch: UntranslatedLitvaks/Yeshivish and Shulchan AruchBuying a Shulchan AruchEnglish version of SHULCHAN ARUCHIs there any place where Shulchan Aruch rules with the Rosh against the Rif and Rambam?Are there practices where Sepharadim do not hold by Shulchan Aruch?5th part of the shulchan aruch