[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2014-01-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

Bug 6421 depends on bug 6422, which changed state.

Bug 6422 Summary: Extract embedded text from PDF documents for search
https://bugzilla.wikimedia.org/show_bug.cgi?id=6422

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #12 from DrTrigon dr.tri...@surfeu.ch ---
(In reply to comment #11)
 Not a bad idea. We have a bug filed for it somewhere?

What about bug 21061, may be bug 13370 as well?! Or shall I create a new one?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #13 from Chad H. innocentkil...@gmail.com ---
(In reply to comment #12)
 (In reply to comment #11)
  Not a bad idea. We have a bug filed for it somewhere?
 
 What about bug 21061, may be bug 13370 as well?! Or shall I create a new one?

Those will do great :)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

DrTrigon dr.tri...@surfeu.ch changed:

   What|Removed |Added

   See Also||https://bugzilla.wikimedia.
   ||org/show_bug.cgi?id=21061

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

DrTrigon dr.tri...@surfeu.ch changed:

   What|Removed |Added

   See Also||https://bugzilla.wikimedia.
   ||org/show_bug.cgi?id=13370

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #14 from DrTrigon dr.tri...@surfeu.ch ---
Good! I linked them.

Am I wrong or should we now be able to close bug 6422 as well?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

DrTrigon dr.tri...@surfeu.ch changed:

   What|Removed |Added

   See Also|https://bugzilla.wikimedia. |
   |org/show_bug.cgi?id=21061   |

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #10 from DrTrigon dr.tri...@surfeu.ch ---
Nice! Good job - thanks!

What about including metadata into the search (indexing) as well??

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #11 from Chad H. innocentkil...@gmail.com ---
(In reply to comment #10)
 Nice! Good job - thanks!
 
 What about including metadata into the search (indexing) as well??

Not a bad idea. We have a bug filed for it somewhere?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #8 from Gerrit Notification Bot gerritad...@wikimedia.org ---
Change 101252 merged by jenkins-bot:
Index and search file text from pdf/djvu files

https://gerrit.wikimedia.org/r/101252

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

Nik Everett neverett+bugzi...@wikimedia.org changed:

   What|Removed |Added

 Status|PATCH_TO_REVIEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Nik Everett neverett+bugzi...@wikimedia.org ---
Merged.  It won't take effect until a full reindex of everything in the file
namespace.  That'll take a few days after the deployment.  Results will start
showing up when the document is indexed.

Also, the file text results are with .8 of a page text result from a scoring
standpoint.

Finally, this'll work with any files from which mediawiki is able to extract
text.  If a new file type is plugged in at a later date those files will have
to be reindexed for the text to be searchable.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #6 from Gerrit Notification Bot gerritad...@wikimedia.org ---
Change 101252 had a related patch set uploaded by Brian Wolff:
Begin indexing file text from pdf/djvu files

https://gerrit.wikimedia.org/r/101252

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

Gerrit Notification Bot gerritad...@wikimedia.org changed:

   What|Removed |Added

 Status|NEW |PATCH_TO_REVIEW

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2013-12-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #7 from Nemo federicol...@tiscali.it ---
(In reply to comment #6)
 Change 101252 had a related patch set uploaded by Brian Wolff:
 Begin indexing file text from pdf/djvu files
 
 https://gerrit.wikimedia.org/r/101252

Ah, wonderful. That's in CirrusSearch and the core part was already done in
https://gerrit.wikimedia.org/r/#/c/99715/ , but there's nothing more specific
than the DjVu component so I'm not moving this bug.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2012-10-15 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

reza1615 reza.ene...@gmail.com changed:

   What|Removed |Added

 CC||reza.ene...@gmail.com
 Blocks|31552   |41037

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2012-09-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #5 from DrTrigon dr.tri...@surfeu.ch 2012-09-10 00:08:25 UTC ---
Is it possible to include metadata into the search (indexing) in mediawiki
software? As I was informed the text layer gets extracted by Pdf Handler [1]
and is stored in the images (PDFs) metadata [2] (name=0).

[1] https://www.mediawiki.org/wiki/Extension:PdfHandler
[2]
http://commons.wikimedia.org/w/api.php?action=queryiilimit=500iiprop=metadata|timestampprop=imageinfotitles=File:Resume-.pdf

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2012-09-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

DrTrigon dr.tri...@surfeu.ch changed:

   What|Removed |Added

 CC||dr.tri...@surfeu.ch
 Depends on||6422

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2012-09-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #3 from DrTrigon dr.tri...@surfeu.ch 2012-09-04 15:32:08 UTC ---
As I can see the bug here is quite old and additionally marked low in
prority. Is this bug up to be fixed at all? In my opinion to solve this bug
here is a *must have*.

DrTrigonBot [1] does file content based categorization in commons. Due to this
embedded text from PDF (later DJVU too) is extracted and processed. We are
currently debating [2] about whether to store this text data to a page - in
order to enable the mediawiki search engine to index and find those contents -
or not.

Now the question is: When is this bug scheduled to become fixed? Will it be
fixed at all? IF NOT; As mentioned DrTrigonBot could dump the files text
content to a dedicated page in order to enable the mediawiki search engine to
handle them. This should be considered as a work-a-round only and would not be
needed at all,
if and when this bug here is solved.

[1] http://commons.wikimedia.org/wiki/User:DrTrigonBot
[2]
http://commons.wikimedia.org/wiki/User_talk:DrTrigonBot/JavaScript#PDF_content_extraction

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2012-09-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

--- Comment #4 from Nemo_bis federicol...@tiscali.it 2012-09-04 15:53:54 UTC 
---
I doubt this will be solved any time soon, there's nobody working on this or
related issues and search is a monster nobody really wants to touch AFAIK, so
I'd suggest you to implement whatever workaround you think it's worth about 5
more years of usage.
The pages you create should probably be as hidden as possible to users, in
particular they shouldn't be indexed by external search engines or they would
e.g. compete with Wikisource (or even archive.org) which doesn't make any
sense.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2012-04-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

Doug wikipediad...@googlemail.com changed:

   What|Removed |Added

 Blocks||31552

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 6421] Extract embedded text from DjVu and PDF documents for search

2012-04-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=6421

Doug wikipediad...@googlemail.com changed:

   What|Removed |Added

 CC||wikipediad...@googlemail.co
   ||m
 Blocks||35925
Summary|Extract embedded text from  |Extract embedded text from
   |DjVu documents for search   |DjVu and PDF documents for
   ||search

--- Comment #2 from Doug wikipediad...@googlemail.com 2012-04-12 22:16:27 UTC 
---
I'm pretty sure he means that search results should include hits from within
the text layer of the file scans proper.  So if the scans include text Foobar
a search for bar would return the text from the scan's text layer as a result
even though the text layer had not been extracted and placed on a wiki page. 
There's no reason that this should be restricted to DjVu's (though there was at
the time the bug was filed)

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l