?

Log in

No account? Create an account
 
 
18 June 2015 @ 06:34 pm
Converting doc/docx/pdf/text to each other  
Lots of applications need to load and convert document files of different formats into other formats or into text.
You would have think that there would be a good solution to it.
Unfortunately it's not the case.
Existing solutions are either for desktop only, or buggy or extremely expensive (~$10K/year).

I thought I found a solution - DevExpress Document Server library for $599.99

Unfortunately, after running for couple of weeks it crashed my service with StackOverflowException exception:
----
https://www.devexpress.com/Support/Center/Question/Details/T257097
To my regret, there is no simple workaround to avoid this exception with your document. Regarding the time frame for fixing this issue, it is difficult to provide any estimate in such cases.
----

So now I need to find a way to prevent my service from dying in case if some random document is fed into it.

Sigh.
 
 
 
cranequinier: 65x70cranequinier on June 19th, 2015 05:22 am (UTC)
Converting DOC in a nice way is basically impossible without VM running Windows 2000 - it's an old COM storage.
Dennis Gorelikdennisgorelik on June 19th, 2015 08:00 am (UTC)
What does "not nice way" for converting DOC mean?

Are memory leaks and occasional fatal crashes pretty much guaranteed?
cranequinier: 65x70cranequinier on June 19th, 2015 03:31 pm (UTC)
> What does "not nice way" for converting DOC mean?

It mean skewered tables and garbage in some places instead of text.

> Are memory leaks and occasional fatal crashes pretty much guaranteed?

For a .NET library on a web server? Of course.
Dennis Gorelikdennisgorelik on June 19th, 2015 07:21 pm (UTC)
> It mean skewered tables and garbage in some places instead of text.

That's not a serious problem.
The crashes that kill the whole process - that's what concerns me.

> For a .NET library on a web server? Of course.

Do you mean any .NET library on a web server crash es occasionally?
Or such crashes are specific to conversion of DOC files (due to old COM storage calls)?

I'm not sure about Web Server, because IIS automatically recovers with problems (so we might have not noticed) but our windows service did not exit due to crash for several years (and when it did it was our silly coding mistake).
sagarasousukesagarasousuke on June 19th, 2015 08:53 am (UTC)
cron 1-min check & restart if service crashed? fast'n'dirty fix when you have a queue to process (i.e. mark "supposed bad" document as "check-manually" and skip it).
Dennis Gorelikdennisgorelik on June 19th, 2015 11:20 am (UTC)
We made a workaround similar to it:
1) Autorestart of our windows service in case of crash.
2) Remembering the hash of resume that crashed and do not convert it next time.

However that is only ok as an ugly patch, because our windows service runs many other processes in parallel to that document converter queue.
All these processes are terminated in the middle.

Another problem is that web site will also crash.
Fortunately IIS automatically recovers crashed thread, but still it makes me think that hard crashes like these can add instability to our system.
sagarasousukesagarasousuke on June 19th, 2015 11:31 am (UTC)
kind of "reliable processing system to be made of unreliable elements", where task state is divided from the processing made with (unstable) agents/services.
serjiojitserserjiojitser on June 28th, 2015 02:14 pm (UTC)
Причин две:

1. PDF достаточно закрытый формат принадлежащий Адобе.
2. PDF весьма сложный (язык).

Примерно тоже самое с Microsoft Word - сложен и владельцы мудаки. И те и те не хотят делится, они хотя стать монополистами форматов.

Поэтому полноценных перефарматировщиков не жди.

Ищи маленькие конвертеры .dll от индусских программистов. Работают на 80% и это максимум.
Dennis Gorelikdennisgorelik on June 28th, 2015 09:11 pm (UTC)
В смысле - использовать отдельные DLL-и для конвертирования pdf->text и doc->docx?

Что значит, работают на 80%?
serjiojitserserjiojitser on June 30th, 2015 03:55 am (UTC)
значит, что некоторые участки, с мудрёным кодом, не конвертируют.. могут оставаться "белые места". бывают так же искажения (но это уже из серии векторной графики, с обычным текстом такое редкость)