How to Index Pure Text into Seperate Fields?
Hi, I am using xpath to index different parts of the html pages into different fields. Now, I have some pure text documents that has no html. So I can't use xpath. How do I index these pure text into different fields of the index? How do I make nutch/solr understand these different parts belong to different fields? Maybe I can use existing content in the fields in my index? Thanks.
Re: How to Index Pure Text into Seperate Fields?
Can you provide a few more details? You mention xpath, which leads me to believe that you are using DIH, is that true? How are you getting your documents to index? Parts of a filesystem? Because it's possible to do many things. If you're using DIH against a filesystem, you could use two fileDataSources, one that works only on files with a particular extension (xml, say) and another that processes .txt files. But that said, if you're trying to index just the text of a Word document, you have to parse it quite differently than a plain text file, take a look at Tika. Al of which may not help you at all, because I'm guessing... So I think a more complete problem statement would help us help you. Best Erick On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: Hi, I am using xpath to index different parts of the html pages into different fields. Now, I have some pure text documents that has no html. So I can't use xpath. How do I index these pure text into different fields of the index? How do I make nutch/solr understand these different parts belong to different fields? Maybe I can use existing content in the fields in my index? Thanks.
Re: How to Index Pure Text into Seperate Fields?
No, I am using xpath for html, this is not the question. I am indexing pure text in addition to html that I was indexing. Pure text like TXT file or Microsoft Word doc. So, no xpath for TXT, how do I index TXT file into different fields in my index like the way I use xpath to index html into differernt fields in my index? My question is referring to pure TXT like .txt file and microsoft word, not html. I am completely fine with html. Thanks. From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, September 29, 2010 2:59:26 PM Subject: Re: How to Index Pure Text into Seperate Fields? Can you provide a few more details? You mention xpath, which leads me to believe that you are using DIH, is that true? How are you getting your documents to index? Parts of a filesystem? Because it's possible to do many things. If you're using DIH against a filesystem, you could use two fileDataSources, one that works only on files with a particular extension (xml, say) and another that processes .txt files. But that said, if you're trying to index just the text of a Word document, you have to parse it quite differently than a plain text file, take a look at Tika. Al of which may not help you at all, because I'm guessing... So I think a more complete problem statement would help us help you. Best Erick On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: Hi, I am using xpath to index different parts of the html pages into different fields. Now, I have some pure text documents that has no html. So I can't use xpath. How do I index these pure text into different fields of the index? How do I make nutch/solr understand these different parts belong to different fields? Maybe I can use existing content in the fields in my index? Thanks.
Re: How to Index Pure Text into Seperate Fields?
Simple text .txt files and MS office .doc files are very very different beasts. You can do simple .txt files with some more lines in your DataImportHandler script. With DOC files it is easiest to use the extracting request handler */extract. This is on the wiki. If you want to do this inside the DataImporthandler, you need to use 3.x or the trunk. And it has bugs. On Wed, Sep 29, 2010 at 3:55 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: No, I am using xpath for html, this is not the question. I am indexing pure text in addition to html that I was indexing. Pure text like TXT file or Microsoft Word doc. So, no xpath for TXT, how do I index TXT file into different fields in my index like the way I use xpath to index html into differernt fields in my index? My question is referring to pure TXT like .txt file and microsoft word, not html. I am completely fine with html. Thanks. From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, September 29, 2010 2:59:26 PM Subject: Re: How to Index Pure Text into Seperate Fields? Can you provide a few more details? You mention xpath, which leads me to believe that you are using DIH, is that true? How are you getting your documents to index? Parts of a filesystem? Because it's possible to do many things. If you're using DIH against a filesystem, you could use two fileDataSources, one that works only on files with a particular extension (xml, say) and another that processes .txt files. But that said, if you're trying to index just the text of a Word document, you have to parse it quite differently than a plain text file, take a look at Tika. Al of which may not help you at all, because I'm guessing... So I think a more complete problem statement would help us help you. Best Erick On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett savannah_becket...@yahoo.com wrote: Hi, I am using xpath to index different parts of the html pages into different fields. Now, I have some pure text documents that has no html. So I can't use xpath. How do I index these pure text into different fields of the index? How do I make nutch/solr understand these different parts belong to different fields? Maybe I can use existing content in the fields in my index? Thanks. -- Lance Norskog goks...@gmail.com