Discussion:
Newbie question
(too old to reply)
Eli
2023-03-10 21:04:24 UTC
Permalink
Hi,

This question has probably been asked before, but I couldn't find it, so here
it is.

I will soon be setting up a text-only news server for public access, but what
is the minimum storage capacity my server needs for all non-binary newsgroups?

Thanks, Eli.
Gary Goog
2023-03-10 22:05:00 UTC
Permalink
Post by Eli
I will soon be setting up a text-only news server for public access, but what
is the minimum storage capacity my server needs for all non-binary newsgroups?
Anything you can afford. Hard disks prices are getting cheaper. Are you
going to require registration or will it be open server like Paganini,
aioe, and mixmin? Open servers are quite popular and you'll get more
users on it.

Make sure you don't filter or censor anything or block anybody on it
otherwise you will become a hate figure and target for hackers.
Eli
2023-03-10 22:18:25 UTC
Permalink
Post by Gary Goog
Post by Eli
I will soon be setting up a text-only news server for public access, but what
is the minimum storage capacity my server needs for all non-binary newsgroups?
Anything you can afford. Hard disks prices are getting cheaper. Are you
going to require registration or will it be open server like Paganini,
aioe, and mixmin? Open servers are quite popular and you'll get more
users on it.
SSD (NVMe) disks are not that cheap, but are they necessary for a news server
or are HDD disks fine?

I was thinking to start with 2x 1.92 TB SSD or is that not enough for all
non-binary groups?

I really have no idea how much data these newsgroups take up.
Grant Taylor
2023-03-10 22:27:57 UTC
Permalink
Post by Eli
SSD (NVMe) disks are not that cheap, but are they necessary for a
news server or are HDD disks fine?
SSD / NVMe drives would be my default as they are faster, quieter, and
consume less power.

That being said, spinning rust is perfectly fine.
Post by Eli
I was thinking to start with 2x 1.92 TB SSD or is that not enough
for all non-binary groups?
LOL That's WAY MORE than you need.

I have my /transit/ news server on a 10 GB file system. It purges
things that are older than 30 days.

My /private/ news server has 147 GB of compressed (ZFS on the fly
compression) of articles going back to November 2018.
Post by Eli
I really have no idea how much data these newsgroups take up.
It really depends on what groups / message size / retention period you keep.

Please feel free to reach out to me when you're ready to peer. I'm
happy to peer with you. I don't mind being the first peer and helping
you get your newsmaster feet wet. (Some people don't want to be the
first peer.)
--
Grant. . . .
unix || die
Marco Moock
2023-03-11 09:55:09 UTC
Permalink
Post by Eli
SSD (NVMe) disks are not that cheap, but are they necessary for a
news server or are HDD disks fine?
20 years ago Usenet was much more popular and SSDs didn't exist.
Neodome Admin
2023-03-16 08:24:06 UTC
Permalink
Post by Eli
I was thinking to start with 2x 1.92 TB SSD or is that not enough for all
non-binary groups?
It will be more than enough.

Just don't ask your peers for full-size articles even if they are posted
to non-binary groups. You literally don't need them even if they are are
posted to text groups. There are no meaningful text articles bigger than
64 Kb. Actually, maximum size is probably 32 Kb or less.

Good luck.
Tom Furie
2023-03-16 08:27:08 UTC
Permalink
There are no meaningful text articles bigger than 64 Kb. Actually,
maximum size is probably 32 Kb or less.
There are several regularly posted FAQs, etc, which are larger than
that.

Cheers,
Tom
Neodome Admin
2023-03-16 08:51:20 UTC
Permalink
Post by Tom Furie
There are no meaningful text articles bigger than 64 Kb. Actually,
maximum size is probably 32 Kb or less.
There are several regularly posted FAQs, etc, which are larger than
that.
Cheers,
Tom
I said "meaningful", Tom :-)

Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs. At least on Usenet. We might pretend all we want, but that's just
the way things are. Those FAQs are nothing more that regular spam in
most of newsgroups where they are posted. How many times you visited a
group and there is nothing except Google Groups drug scam and those
FAQs? Probably a lot of times, huh?
DV
2023-03-16 08:54:46 UTC
Permalink
Post by Neodome Admin
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs.
I do.
--
Denis

Serveurs de news et passerelles web : <http://usenet-fr.yakakwatik.org>
Lecteurs de news : <http://usenet-fr.yakakwatik.org/lecteurs-de-news.html>
Neodome Admin
2023-03-16 09:25:42 UTC
Permalink
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one
cares. I'm sorry guys. No one needs those FAQs.

I look at the server stats and even though there is no open posting
anymore I still see hundreds of people reading via my servers. And after
all these years not a single one of them ever complained that they can't
read some article. And it's not like I was running the server for a year
or two.
Tom Furie
2023-03-16 09:35:57 UTC
Permalink
Post by Neodome Admin
I look at the server stats and even though there is no open posting
anymore I still see hundreds of people reading via my servers. And after
all these years not a single one of them ever complained that they can't
read some article. And it's not like I was running the server for a year
or two.
That's your server, run it however you like. The person you suggested to
limit article sizes to 64 or 32k might like to know that there are
larger articles which may be considered of interest, that's their call
to make.

Cheers,
Tom
Neodome Admin
2023-03-16 10:08:13 UTC
Permalink
Post by Tom Furie
Post by Neodome Admin
I look at the server stats and even though there is no open posting
anymore I still see hundreds of people reading via my servers. And after
all these years not a single one of them ever complained that they can't
read some article. And it's not like I was running the server for a year
or two.
That's your server, run it however you like. The person you suggested to
limit article sizes to 64 or 32k might like to know that there are
larger articles which may be considered of interest, that's their call
to make.
Absolutely. You are correct. However, that person was asking for advice
on running a text-only Usenet server, and that's exactly what I
provided. In my opinion, there are no larger articles which may be
considered of interest, if text Usenet is the interest.
DV
2023-03-16 09:39:32 UTC
Permalink
Post by Neodome Admin
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one
cares. I'm sorry guys. No one needs those FAQs.
I say it again: i do. You should stop repeating that *no one* needs
them, unless you think I don't exist.
--
Denis

Serveurs de news et passerelles web : <http://usenet-fr.yakakwatik.org>
Lecteurs de news : <http://usenet-fr.yakakwatik.org/lecteurs-de-news.html>
Neodome Admin
2023-03-16 10:19:20 UTC
Permalink
Post by DV
Post by Neodome Admin
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one
cares. I'm sorry guys. No one needs those FAQs.
I say it again: i do. You should stop repeating that *no one* needs
them, unless you think I don't exist.
I don't think you don't exist. I think you belong to binary Usenet, and
you're free to read and post anything you want as long as all parties
involved agree on that.

No, seriously, I have no problems with 700+KB posts. If you want, I can
set up a script posting any 700+KB FAQ you want, to any newsgroup, using
your name, as often as you want, and even more often. What do you say?
Grant Taylor
2023-03-16 18:21:52 UTC
Permalink
I think you belong to binary Usenet, and you're free to read and post
anything you want as long as all parties involved agree on that.
Wait a minute.

We're talking about a /text/ post consisting of entirely printable ASCII
meant to be read by a human. That's very much so /text/. It's not
binary encoded in text.

You are free to have your own opinion of the value of such FAQ posts.
But your lack of value for them doesn't make them any less of a text post.

You are free to run your server however you want. But I think others
should think long and hard before following the advice that you're posting.
No, seriously, I have no problems with 700+KB posts. If you want, I can
set up a script posting any 700+KB FAQ you want, to any newsgroup, using
your name, as often as you want, and even more often. What do you say?
Stop it.

I know that you know that would be a form of abuse.

I expect better of that from a fellow newsmaster.
--
Grant. . . .
unix || die
Neodome Admin
2023-03-22 06:19:30 UTC
Permalink
Post by Grant Taylor
I think you belong to binary Usenet, and you're free to read and
post anything you want as long as all parties involved agree on
that.
Wait a minute.
We're talking about a /text/ post consisting of entirely printable
ASCII meant to be read by a human. That's very much so /text/. It's
not binary encoded in text.
They are not posts *created* by humans, and this is my problem with
them. Of course, if we'll try to be completely logical about this, there
can be posts created by humans with binary files attached, etc., and no
one cares about those.
Post by Grant Taylor
No, seriously, I have no problems with 700+KB posts. If you want, I can
set up a script posting any 700+KB FAQ you want, to any newsgroup, using
your name, as often as you want, and even more often. What do you say?
Stop it.
I know that you know that would be a form of abuse.
You are correct, Grant. It was a sarcasm.
Grant Taylor
2023-03-22 17:57:55 UTC
Permalink
Post by Neodome Admin
They are not posts *created* by humans, and this is my problem with
them.
Okay. I think that's a reasonable differentiation. I also think that
it's one that's harder to programmatically determine.
Post by Neodome Admin
Of course, if we'll try to be completely logical about this, there
can be posts created by humans with binary files attached, etc.,
and no one cares about those.
ACK

Hence the programmatically comment. ;-)
Post by Neodome Admin
You are correct, Grant. It was a sarcasm.
Ah. That obviously didn't come across. -- I've been dealing with
recruiters who seem to range from the intelligence / willingness to
exert effort between a slug and a track superstar. My calibration is
out of whack at the moment.
--
Grant. . . .
unix || die
news.usenet.ovh Admin
2023-03-22 23:55:58 UTC
Permalink
Post by DV
Post by Neodome Admin
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one
cares. I'm sorry guys. No one needs those FAQs.
I say it again: i do. You should stop repeating that *no one* needs
them, unless you think I don't exist.
Sure, you exist.
But who want obsolete faq ? To do what ?

--
List of free servers that distribute the hierarchy "fr"
Liste de serveurs gratuit qui distribue la hiérarchie "fr"

http://usenet.ovh/?article=faq_serveur_gratuit
Grant Taylor
2023-03-23 00:05:41 UTC
Permalink
Post by news.usenet.ovh Admin
Sure, you exist.
But who want obsolete faq ? To do what ?
I believe you already have the answer to your question.

I also (re)read things from the days of yore. Some things are worth
reading. Some are not. You have to read them to find out which they
are. }:-)
--
Grant. . . .
unix || die
llp
2023-03-16 22:20:11 UTC
Permalink
Post by Neodome Admin
I do.
Like I said, we can pretend all we want, but old Usenet is gone. No one
cares. I'm sorry guys. No one needs those FAQs.
I agree.
And a lot of them are outdated.
Post by Neodome Admin
I look at the server stats and even though there is no open posting
anymore I still see hundreds of people reading via my servers. And after
all these years not a single one of them ever complained that they can't
read some article. And it's not like I was running the server for a year
or two.
;-)
--
New usenet server for fr.* et news.*
news.usenet.ovh
Fell free to contact us: ***@usenet.ovh
😉 Good Guy 😉
2023-03-16 23:00:00 UTC
Permalink
The main message is in html section of this post but you are not able to read it because you are using an unapproved news-client. Please try these links to amuse youself:

<Loading Image...>
<Loading Image...>
<Loading Image...>
--
https://contact.mainsite.tk
Jesse Rehmer
2023-03-16 09:14:47 UTC
Permalink
Post by Neodome Admin
I said "meaningful", Tom :-)
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs. At least on Usenet. We might pretend all we want, but that's just
the way things are. Those FAQs are nothing more that regular spam in
most of newsgroups where they are posted. How many times you visited a
group and there is nothing except Google Groups drug scam and those
FAQs? Probably a lot of times, huh?
I've found many useful over the years. If you're not being fed them it seems
difficult to judge their value.

For people getting into retro computing (Atari, Amiga, etc.), some of those
700+KB FAQ articles are gold.
Neodome Admin
2023-03-16 09:43:56 UTC
Permalink
Post by Jesse Rehmer
For people getting into retro computing (Atari, Amiga, etc.), some of those
700+KB FAQ articles are gold.
I don't doubt that. I doubt that regular posting of 700+KB FAQ is doing
any good. I doubt that anything in those FAQs is more useful than
information that can be found with Google or DuckDuckGo. We're not
living in an era of Altavista, after all. And if there is some kind of
gem hidden there, one simply don't need to post it to newsgroup
regularly with 700+KB of irrelevant text. Plus, I'm pretty sure that if
there are any questions, one can just ask a question in retro-computing
group and expect an answer... unless that group is dead, of course.
Grant Taylor
2023-03-16 18:30:16 UTC
Permalink
Post by Neodome Admin
I don't doubt that.
So you agree that the content of the articles does have some value to
some people.
Post by Neodome Admin
I doubt that regular posting of 700+KB FAQ is doing any good.
What's your primary objection? The frequency or the size of the posts?
Post by Neodome Admin
I doubt that anything in those FAQs is more useful than information
that can be found with Google or DuckDuckGo. We're not living in an
era of Altavista, after all. And if there is some kind of gem hidden
there, one simply don't need to post it to newsgroup regularly with
700+KB of irrelevant text.
I think that there is some value in having some unrequested information
put in front of you.

I've seen many things that I didn't know that I wanted to know put in
front of me.

I've also been mildly interested in something and seen something new (to
me) done with it that really peaks my interest and causes me to actively
investigate it.

I believe there is some value in things being put in front of me for my
perusal.
Post by Neodome Admin
Plus, I'm pretty sure that if there are any questions, one can just
ask a question in retro-computing group and expect an answer... unless
that group is dead, of course.
It's really hard to ask a question about something if you don't know
that said something exists.

I don't mind quarterly or even monthly posting of FAQs. I do have an
objection to super large FAQs. -- I think I have my server configured
to accept 128 kB articles.

Even at 1 MB, this is only a few seconds worth of audio / video as --
purportedly -- ***@Neodome pointed out in a different message. These
messages really are not much to sneeze at. -- My news server sees 50
or more of these messages worth of traffic per day. So, one of these
per month, much less quarter, not even worth complaining about.
--
Grant. . . .
unix || die
Ted Heise
2023-03-16 18:51:12 UTC
Permalink
On Thu, 16 Mar 2023 12:30:16 -0600,
Post by Grant Taylor
Post by Neodome Admin
I doubt that anything in those FAQs is more useful than
information that can be found with Google or DuckDuckGo. We're
not living in an era of Altavista, after all. And if there is
some kind of gem hidden there, one simply don't need to post
it to newsgroup regularly with 700+KB of irrelevant text.
I think that there is some value in having some unrequested
information put in front of you.
I've seen many things that I didn't know that I wanted to know
put in front of me.
I've also been mildly interested in something and seen
something new (to me) done with it that really peaks my
interest and causes me to actively investigate it.
I believe there is some value in things being put in front of
me for my perusal.
Post by Neodome Admin
Plus, I'm pretty sure that if there are any questions, one can
just ask a question in retro-computing group and expect an
answer... unless that group is dead, of course.
It's really hard to ask a question about something if you don't
know that said something exists.
I don't mind quarterly or even monthly posting of FAQs. I do
have an objection to super large FAQs. -- I think I have my
server configured to accept 128 kB articles.
Lots of good points, Grant. As usual. :)

FWIW, I maintan the FAQ (for alt.recovery.aa). It's just under
40k in size and posts from a cron job first day of each month.

Also, on a weekly basis (except the first week of each month)
another cron job posts a very short FAQ pointer, directing folks
to where it can be found on the web. The pointer idea was gleaned
from another group (alt.os.linux.slackware).
--
Ted Heise <***@panix.com> West Lafayette, IN, USA
Neodome Admin
2023-03-22 07:18:29 UTC
Permalink
Post by Grant Taylor
Post by Neodome Admin
I don't doubt that.
So you agree that the content of the articles does have some value to
some people.
Same as binary MIME attachments to legit Usenet messages written by real
people. They have some value for me if they add to the conversation. Is
there really a reason to avoid them now when I literally use more memory
on my 256 GB iPhone to store pictures of random dogs and cats than I use
on my server to store 2 years of unfiltered text Usenet? By unfiltered I
mean completely unfiltered, all Google Groups spam and other junk
included.

I just find it technically much simpler to differentiate by the article
size. Bigger than some value - binary. Smaller - text Usenet.

Thus my advice.
Post by Grant Taylor
Post by Neodome Admin
I doubt that regular posting of 700+KB FAQ is doing any good.
What's your primary objection? The frequency or the size of the posts?
Post by Neodome Admin
I doubt that anything in those FAQs is more useful than information
that can be found with Google or DuckDuckGo. We're not living in an
era of Altavista, after all. And if there is some kind of gem hidden
there, one simply don't need to post it to newsgroup regularly with
700+KB of irrelevant text.
I think that there is some value in having some unrequested
information put in front of you.
I've seen many things that I didn't know that I wanted to know put in
front of me.
I've also been mildly interested in something and seen something new
(to me) done with it that really peaks my interest and causes me to
actively investigate it.
I believe there is some value in things being put in front of me for
my perusal.
FAQs are little bit different story than other messages. Like I said, my
main problem with them is that they're not written by the people, and
thus I don't see the need to treat them any different than spam and
binaries. After all, all those binary messages also can be useful for
someone, maybe even bigger amount of people will find them more useful
compared to FAQs.

I think that legit text conversations in binary newsgroups bring more to
the Usenet as communication platform than bi-weekly FAQs in dead text
newsgroups, thus they are the ones that deserve to be preserved for the
future readers. BTW, currently it's not being done by text Usenet
servers.
Post by Grant Taylor
Post by Neodome Admin
Plus, I'm pretty sure that if there are any questions, one can just
ask a question in retro-computing group and expect an
answer... unless that group is dead, of course.
It's really hard to ask a question about something if you don't know
that said something exists.
I don't mind quarterly or even monthly posting of FAQs. I do have an
objection to super large FAQs. -- I think I have my server
configured to accept 128 kB articles.
Even at 1 MB, this is only a few seconds worth of audio / video as --
These messages really are not much to sneeze at. -- My news server
sees 50 or more of these messages worth of traffic per day. So, one
of these per month, much less quarter, not even worth complaining
about.
You are correct. If there are FAQs bigger than 64 Kb, the amount of data
they consume is miniscule compared even to the Google Groups
spam. Actually, thinking of it, I might receive them anyway from one of
the peers who set their newsfeeds incorrectly, and probably still didn't
fix it. I just never complained about it because it's not a problem from
technical point of view.
Grant Taylor
2023-03-22 18:09:28 UTC
Permalink
Post by Neodome Admin
Same as binary MIME attachments to legit Usenet messages written
by real people. They have some value for me if they add to the
conversation.
Fair.
Post by Neodome Admin
Is there really a reason to avoid them now when I literally use more
memory on my 256 GB iPhone to store pictures of random dogs and cats
than I use on my server to store 2 years of unfiltered text Usenet? By
unfiltered I mean completely unfiltered, all Google Groups spam and
other junk included.
ACK
Post by Neodome Admin
I just find it technically much simpler to differentiate by the
article size. Bigger than some value - binary. Smaller - text Usenet.
I agree that the size is a likely indicator of binary or not.

Though, I wonder if we are now in the day & age that we could create
filters that either:

- detect multiple strings of text with white space between them, thus
words.
- detect the standard encoding methods; e.g. 76 x [A-Za-z0-0+/=] for
base64
Post by Neodome Admin
Thus my advice.
Fair enough. Your server, your rules.
Post by Neodome Admin
FAQs are little bit different story than other messages. Like I said,
my main problem with them is that they're not written by the people,
and thus I don't see the need to treat them any different than spam
and binaries. After all, all those binary messages also can be useful
for someone, maybe even bigger amount of people will find them more
useful compared to FAQs.
I understand what you're saying.

But is there anything to differentiate the FAQs posted by automation and
an FAQ copied from a template and pasted into the news reader by a
human? ;-)

The original text was almost certainly written by a human. Even if the
current form it is in is an amalgamation of copy & paste et al.
Post by Neodome Admin
I think that legit text conversations in binary newsgroups bring more
to the Usenet as communication platform than bi-weekly FAQs in dead
text newsgroups, thus they are the ones that deserve to be preserved
for the future readers.
I can agree with that.
Post by Neodome Admin
BTW, currently it's not being done by text Usenet servers.
Agreed.

I suspect that's based on older methods of identifying / handling binary
attachments.
Post by Neodome Admin
You are correct. If there are FAQs bigger than 64 Kb, the amount of
data they consume is miniscule compared even to the Google Groups
spam. Actually, thinking of it, I might receive them anyway from one
of the peers who set their newsfeeds incorrectly, and probably still
didn't fix it. I just never complained about it because it's not a
problem from technical point of view.
We all have things that we could improve on. I choose to focus on the
things with bigger impact.
--
Grant. . . .
unix || die
Jesse Rehmer
2023-03-22 18:40:34 UTC
Permalink
Post by Grant Taylor
Though, I wonder if we are now in the day & age that we could create
- detect multiple strings of text with white space between them, thus
words.
- detect the standard encoding methods; e.g. 76 x [A-Za-z0-0+/=] for
base64
Diablo has this article type detection built in and allows you to filter based
on types in newsfeed definitions. Cleanfeed and pyClean do the same for INN.
it's not perfect, but pretty damn effective.
Tom Furie
2023-03-22 19:33:13 UTC
Permalink
Post by Jesse Rehmer
Post by Grant Taylor
Though, I wonder if we are now in the day & age that we could create
- detect multiple strings of text with white space between them, thus
words.
- detect the standard encoding methods; e.g. 76 x [A-Za-z0-0+/=] for
base64
Diablo has this article type detection built in and allows you to filter based
on types in newsfeed definitions. Cleanfeed and pyClean do the same for INN.
it's not perfect, but pretty damn effective.
While Cleanfeed is effective enough at what it does, there's no "smarts"
to it and it can be a chore coming up with effective patterns that work
but don't get in the way of legitimate posts that happen to contain some
of the "trouble" words or phrases. I've been wondering if it might be
possible to use something like spamassassin with bayesian learning on a
newsfeed though I haven't got to the point of trying to implement
anything yet.

Cheers,
Tom
Grant Taylor
2023-03-22 19:36:19 UTC
Permalink
Post by Tom Furie
While Cleanfeed is effective enough at what it does, there's no
"smarts" to it and it can be a chore coming up with effective patterns
that work but don't get in the way of legitimate posts that happen to
contain some of the "trouble" words or phrases.
Please elaborate and share some examples.

In the context of detecting encoded binary attachments, I feel like that
should be relatively easy to do.
Post by Tom Furie
I've been wondering if it might be possible to use something like
spamassassin with bayesian learning on a newsfeed though I haven't
got to the point of trying to implement anything yet.
I don't know what SpamAssassin will think of news articles.

I wonder if it would be possible to leverage something like the milter
interface to SpamAssassin so that you don't need to integrate and or
fork SpamAssassin.
--
Grant. . . .
unix || die
Tom Furie
2023-03-22 19:54:14 UTC
Permalink
Post by Grant Taylor
In the context of detecting encoded binary attachments, I feel like that
should be relatively easy to do.
Oh, there's no problem for it catching binaries, that's a non-issue. I'm
talking about methods for catching the still ever prevelant text spam..
Post by Grant Taylor
Post by Tom Furie
I've been wondering if it might be possible to use something like
spamassassin with bayesian learning on a newsfeed though I haven't
got to the point of trying to implement anything yet.
I don't know what SpamAssassin will think of news articles.
I don't imagine it will have any problem with the bodies, but the
headers will likely be a different matter since I doubt spamassassin
knows anything about them. Maybe some custom rulesets to inform it what
to look at...
Post by Grant Taylor
I wonder if it would be possible to leverage something like the milter
interface to SpamAssassin so that you don't need to integrate and or
fork SpamAssassin.
Yes, I was thinking of interfacing that way, or feeding everything off to
spamd.

Cheers,
Tom
Grant Taylor
2023-03-22 23:53:03 UTC
Permalink
Post by Tom Furie
Oh, there's no problem for it catching binaries, that's a
non-issue. I'm talking about methods for catching the still ever
prevelant text spam..
Oh. Okay. That makes more sense.
Post by Tom Furie
I don't imagine it will have any problem with the bodies, but the
headers will likely be a different matter since I doubt spamassassin
knows anything about them. Maybe some custom rulesets to inform it
what to look at...
I wouldn't be surprised if SpamAssassin did know what to do with a news
post.

I also would be surprised if it couldn't be taught how to deal with news
posts.
Post by Tom Furie
Yes, I was thinking of interfacing that way, or feeding everything
off to spamd.
:-)
--
Grant. . . .
unix || die
Tom Furie
2023-03-22 20:43:09 UTC
Permalink
Post by Grant Taylor
Post by Tom Furie
While Cleanfeed is effective enough at what it does, there's no
"smarts" to it and it can be a chore coming up with effective patterns
that work but don't get in the way of legitimate posts that happen to
contain some of the "trouble" words or phrases.
Please elaborate and share some examples.
Here are a few that I think illustrate the "effective pattern" problem.
Now, this sample is all Google - which is already tagged as a known spam
source - but still they made it through. Sure, I could just block the
sender, but that seems a bit of "blunt instrument" approach to me. And
what happens in the potential situation where a spammer forges an
otherwise legitimate poster's email address, etc?

There's also the posts whereby the originals get caught by the filter,
but the fully quoted replies including full headers posted into the body
of the "complaint", make it through. That's one poster I'm incredibly
close to outright banning since he's effectively simply a reflector of
the original spam.

<c8eac7b9-bcf8-4c74-98a8-***@googlegroups.com>
<1273d23c-0256-4317-97d4-***@googlegroups.com>
<bda61489-0096-4ff2-9a66-***@googlegroups.com>

Cheers,
Tom
Grant Taylor
2023-03-22 23:57:43 UTC
Permalink
Post by Tom Furie
Here are a few that I think illustrate the "effective pattern" problem.
Thank you for the message IDs. Unfortunately Thunderbird is treating
them as email addresses. I'll have to find a way to look them up.
Post by Tom Furie
Now, this sample is all Google - which is already tagged as a known
spam source - but still they made it through. Sure, I could just
block the sender, but that seems a bit of "blunt instrument" approach
to me. And what happens in the potential situation where a spammer
forges an otherwise legitimate poster's email address, etc?
Ya. I'm not a fan of blocking Google carte blanche like some advocate for.
Post by Tom Furie
There's also the posts whereby the originals get caught by the filter,
but the fully quoted replies including full headers posted into
the body of the "complaint", make it through. That's one poster I'm
incredibly close to outright banning since he's effectively simply
a reflector of the original spam.
Oh ya. I hear you on that one. I'm a single digit number of such
examples away from banning a user like that too. I sort of suspect we
are talking about the same user. Possibly one with a professional
sounding name?
--
Grant. . . .
unix || die
Gew Ghul Suques
2023-03-23 03:09:12 UTC
Permalink
Ya.  I'm not a fan of blocking Google carte blanche like some advocate for.
Phuque gewghul.
--
Gew Ghul Suques
Jesse Rehmer
2023-03-22 19:53:35 UTC
Permalink
Post by Tom Furie
Post by Jesse Rehmer
Post by Grant Taylor
Though, I wonder if we are now in the day & age that we could create
- detect multiple strings of text with white space between them, thus
words.
- detect the standard encoding methods; e.g. 76 x [A-Za-z0-0+/=] for
base64
Diablo has this article type detection built in and allows you to filter based
on types in newsfeed definitions. Cleanfeed and pyClean do the same for INN.
it's not perfect, but pretty damn effective.
While Cleanfeed is effective enough at what it does, there's no "smarts"
to it and it can be a chore coming up with effective patterns that work
but don't get in the way of legitimate posts that happen to contain some
of the "trouble" words or phrases. I've been wondering if it might be
possible to use something like spamassassin with bayesian learning on a
newsfeed though I haven't got to the point of trying to implement
anything yet.
Cheers,
Tom
I agree, and I decided that Diablo's duplicate article detection is good
enough for me in regards to spam filtering. Interestingly, all the default
documentation and examples only talk about using this duplicate detection for
binary articles, but I changed the feed definition to include everything and
seems about as effective as Cleanfeed/pyClean. It's binary detection seems
really good, and I'm not chewing up any noticeable CPU with filtering now.

I do use pyClean with some bad_from and bad_subject filters on my spool server
for finer granularity there.

Seems like I remember efforts in the past, perhaps not specific to INN or
Diablo, but other tools to implement SpamAssassin for filtering articles, but
off hand can't recall where that conversation occurred.
Julien ÉLIE
2023-03-22 21:00:17 UTC
Permalink
Hi all,
Post by Jesse Rehmer
Post by Grant Taylor
I don't know what SpamAssassin will think of news articles.
I wonder if it would be possible to leverage something like the
milter interface to SpamAssassin so that you don't need to
integrate and>> or fork SpamAssassin. >
Seems like I remember efforts in the past, perhaps not specific to INN or
Diablo, but other tools to implement SpamAssassin for filtering articles, but
off hand can't recall where that conversation occurred.
From: yamo' <***@beurdin.invalid>
Newsgroups: news.software.nntp
Subject: Re: Google Groups spam - INN/Cleanfeed/etc solutions?
Date: Sun, 19 Sep 2021 10:11:24 -0000 (UTC)
Message-ID: <si72cc$ko9$***@pi2.pasdenom.info>

:-)
--
Julien ÉLIE

« Ta remise sur pied lui a fait perdre la tête ! » (Astérix)
Tom Furie
2023-03-22 21:27:01 UTC
Permalink
Post by Julien ÉLIE
Newsgroups: news.software.nntp
Subject: Re: Google Groups spam - INN/Cleanfeed/etc solutions?
Date: Sun, 19 Sep 2021 10:11:24 -0000 (UTC)
Ooh, nice! That's going to be well worth a look into.

Cheers,
Tom
Richard Kettlewell
2023-03-16 13:07:12 UTC
Permalink
Post by Jesse Rehmer
Post by Neodome Admin
I said "meaningful", Tom :-)
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs. At least on Usenet. We might pretend all we want, but that's just
the way things are. Those FAQs are nothing more that regular spam in
most of newsgroups where they are posted. How many times you visited a
group and there is nothing except Google Groups drug scam and those
FAQs? Probably a lot of times, huh?
I've found many useful over the years. If you're not being fed them it
seems difficult to judge their value.
For people getting into retro computing (Atari, Amiga, etc.), some of
those 700+KB FAQ articles are gold.
How many of them contain any information that can’t be found on the web?
--
https://www.greenend.org.uk/rjk/
Jesse Rehmer
2023-03-16 13:57:52 UTC
Permalink
On Mar 16, 2023 at 8:07:12 AM CDT, "Richard Kettlewell"
Post by Richard Kettlewell
Post by Jesse Rehmer
Post by Neodome Admin
I said "meaningful", Tom :-)
Seriously, it's not 1995, 2001, or even 2008. No one read those
FAQs. At least on Usenet. We might pretend all we want, but that's just
the way things are. Those FAQs are nothing more that regular spam in
most of newsgroups where they are posted. How many times you visited a
group and there is nothing except Google Groups drug scam and those
FAQs? Probably a lot of times, huh?
I've found many useful over the years. If you're not being fed them it
seems difficult to judge their value.
For people getting into retro computing (Atari, Amiga, etc.), some of
those 700+KB FAQ articles are gold.
How many of them contain any information that can’t be found on the web?
Same can be said for all of Usenet if you take that stance.
Marco Moock
2023-03-11 09:54:19 UTC
Permalink
Post by Gary Goog
Post by Eli
I will soon be setting up a text-only news server for public
access, but what is the minimum storage capacity my server needs
for all non-binary newsgroups?
Anything you can afford. Hard disks prices are getting cheaper. Are
you going to require registration or will it be open server like
Paganini, aioe, and mixmin? Open servers are quite popular and you'll
get more users on it.
Make sure you don't filter or censor anything or block anybody on it
otherwise you will become a hate figure and target for hackers.
Such servers will likely be abused by trolls, name forgers and spammers.
This causes many people to filter out any post coming from such a
server.

In the German de.* hierarchy, many people filtered out mixmin and aioe
because they were abused by trolls.

I recommend setting up a registration and terminating accounts that
abuse the server with such behaviour.
U.ee
2023-03-10 22:14:07 UTC
Permalink
Hello!
Post by Eli
Hi,
This question has probably been asked before, but I couldn't find it, so here
it is.
I will soon be setting up a text-only news server for public access, but what
is the minimum storage capacity my server needs for all non-binary newsgroups?
Thanks, Eli.
Depends, how much spam and low value articles you can filter out.
20-30GB per year is comfortable.
You can do with way less, if you have curated list of groups and good
spam filter.

Best regards,
U.ee
Eli
2023-03-10 22:21:38 UTC
Permalink
Post by U.ee
Hello!
Depends, how much spam and low value articles you can filter out.
20-30GB per year is comfortable.
You can do with way less, if you have curated list of groups and good
spam filter.
Best regards,
U.ee
If that is for all non-binary newsgroups, then that's not bad. I expected
considerably more.

Thanks for the info.
Jesse Rehmer
2023-03-11 01:15:31 UTC
Permalink
Post by Eli
Hi,
This question has probably been asked before, but I couldn't find it, so here
it is.
I will soon be setting up a text-only news server for public access, but what
is the minimum storage capacity my server needs for all non-binary newsgroups?
Thanks, Eli.
Storage capacity wise, I've got 20 years of the Big8 consuming ~750GB. On a
server with ZFS using CNFS buffers with INN, this can compress down to about
300GB using default ZFS compression.

Cheers,
Jesse
Marco Moock
2023-03-11 09:52:50 UTC
Permalink
Post by Eli
I will soon be setting up a text-only news server for public access,
but what is the minimum storage capacity my server needs for all
non-binary newsgroups?
It mostly depends how long you will keep old articles.

Look at some statistics:
https://www.eternal-september.org/stats/index.html

Per day you should need at least 50MB.
Timo
2023-03-12 01:11:24 UTC
Permalink
Post by Eli
Hi,
This question has probably been asked before, but I couldn't find it, so here
it is.
I will soon be setting up a text-only news server for public access, but what
is the minimum storage capacity my server needs for all non-binary newsgroups?
Thanks, Eli.
Hi Eli,

the memory requirement for a news server without binaries is rather small.

If you set up a server based on Debian or Ubuntu, plan around 15-20 GB,
because the log files will quickly fill up your disk if there are errors.

For the pure data of the newsgroups, it depends on how long you want to
keep the articles and how many newsgroups you want to provide.

I keep a relatively large portfolio of groups on the server and get
about 40 - 45 GB per year (with spam filter).

If you have the possibility, take a good 500 GB HDD, then you have a lot
of space and don't have to worry about it for the time being.

I have had rather negative experiences with SSDs when operating a news
server, as they age very quickly with the enormous number of write cycles.

Greetings,
--
Timo
Grant Taylor
2023-03-12 07:20:01 UTC
Permalink
Post by Timo
If you set up a server based on Debian or Ubuntu, plan around 15-20 GB,
because the log files will quickly fill up your disk if there are errors.
I would *STRONGLY* suggest checking out log-rotate or the likes if
you're not using it.
--
Grant. . . .
unix || die
San Kirtan Dass
2023-03-12 09:24:34 UTC
Permalink
Post by Grant Taylor
Post by Timo
If you set up a server based on Debian or Ubuntu, plan around 15-20
GB, because the log files will quickly fill up your disk if there are
errors.
I would *STRONGLY* suggest checking out log-rotate or the likes if
you're not using it.
I would rather just set up inotify scripts to truncate or delete log
files to prevent them from filling up a lot of space.

Does INN2 require any of the data in the log files for operation?

Is it safe to delete the log files once they reach a certain size?

What about truncating the log files to X lines every Y hours or when
inotify reports a size limit?
--
San Kirtan Dass
Grant Taylor
2023-03-12 17:57:38 UTC
Permalink
Post by San Kirtan Dass
I would rather just set up inotify scripts to truncate or delete log
files to prevent them from filling up a lot of space.
Okay. I'm not sure why you would want to re-invent the wheel differently.
Post by San Kirtan Dass
Does INN2 require any of the data in the log files for operation?
I don't think that /log/ files are required for operation. There are
other files that grow that are used for operation; e.g. lists of
messages that are waiting to be feed to peers.
Post by San Kirtan Dass
Is it safe to delete the log files once they reach a certain size?
I /think/ so.

You could try renaming them so that they aren't at the path & file name
that INN is looking for and then HUP / restart INN and see if you have
problems. It would be easy to put them back if you needed to.
Post by San Kirtan Dass
What about truncating the log files to X lines every Y hours or when
inotify reports a size limit?
Simply truncating files without doing anything else is likely to cause
some corruption and / or uncontrolled disk consumption. You can reduce
the size of the file on disk, but anything with an open file handle may
not know that the file size has shrunk and may therefore do the wrong
thing the next time it writes to the file.

I'm curious why you want to go the inotify route as opposed to simply a
cron job that periodically checks the size of file(s) and takes proper
action if they are over a threshold (size and / or age).

This type of script is -- as I understand it -- exactly what logrotate
does and you can easily alter how frequently cron runs it.

This is why I say that it feels like you're re-inventing the wheel.

If you want to re-invent the wheel, by all means go ahead and do so. I
just suggest you check out existing wheels, logrotate (et al.), /before/
you re-invent a new wheel to see what they do and / or don't do that you
want done.
--
Grant. . . .
unix || die
Julien ÉLIE
2023-03-14 11:41:49 UTC
Permalink
Hi Grant and San,
Post by Grant Taylor
Post by San Kirtan Dass
What about truncating the log files to X lines every Y hours or when
inotify reports a size limit?
Simply truncating files without doing anything else is likely to cause
some corruption and / or uncontrolled disk consumption.  You can reduce
the size of the file on disk, but anything with an open file handle may
not know that the file size has shrunk and may therefore do the wrong
thing the next time it writes to the file.
Indeed.
FWIW, INN comes with a program doing that (scanlogs):
https://www.eyrie.org/~eagle/software/inn/docs/scanlogs.html

"""
scanlogs invokes "ctlinnd flushlogs" to close the news and error log
files, rename them to add .old to the file names and open fresh news and
error logs; the active file is also flushed to disk, along with the
history database.

By default, scanlogs rotates and cleans out the logs. It keeps up to
logcycles old compressed log files in pathlog/OLD (the logcycles
parameter can be set in inn.conf). scanlogs also keeps archives of the
active file in this directory.
"""
Post by Grant Taylor
I'm curious why you want to go the inotify route as opposed to simply a
cron job that periodically checks the size of file(s) and takes proper
action if they are over a threshold (size and / or age).
Isn't it enough to run "news.daily" every day out of cron?
https://www.eyrie.org/~eagle/software/inn/docs/news.daily.html

It performs log rotation (with scanlogs) amongst other things.

I really doubt that INN log files will fill up a 1 TB disk in one day...
but in case one wishes to check for that, I would then suggest to run
scanlogs when inotify or a dedicated cron job checking the available
disk space report that something should be done.
--
Julien ÉLIE

« C'est la goutte qui fait déborder l'amphore ! » (Assurancetourix)
Eli
2023-03-12 18:24:31 UTC
Permalink
On 10 Mar 2023 at 22:04:24 CET, "Eli" <***@gmail.com> wrote:

Does INN automatically populate the database with all existing articles from a
NEW peer or only new articles that come in. If not, is there a way to download
all existing articles from a (commercial) news server via INN?

I know this is a lot of data to download :)

Eli.
Grant Taylor
2023-03-12 22:30:42 UTC
Permalink
Post by Eli
Does INN automatically populate the database with all existing
articles from a NEW peer or only new articles that come in.
INN, et al., only receive articles provided to them. They don't pull
articles /themselves/. There are some tools to pull articles for this
purpose. Some of these tools do come with INN.

This is how all the different news servers I've messed with have behaved.
Post by Eli
If not, is there a way to download all existing articles from a
(commercial) news server via INN?
As said above, there are some tools that can be used to pull messages.
I believe that `suck` is one such tool.
Post by Eli
I know this is a lot of data to download :)
The download isn't the hard part. The hard part will be getting those
messages into your local INN instance. You'll need to (temporarily)
disable default protections which reject older articles.

There are probably other ways to get older articles into INN, e.g.
modifying the spool directly, but there be dragons.
--
Grant. . . .
unix || die
Julien ÉLIE
2023-03-14 11:42:11 UTC
Permalink
Hi San,
Post by Eli
If not, is there a way to download all existing articles from a
(commercial) news server via INN?
As said above, there are some tools that can be used to pull messages. I
believe that `suck` is one such tool.
Yes, suck (an external program) does the job.
There's also pullnews, shipped with INN:
https://www.eyrie.org/~eagle/software/inn/docs/pullnews.html
The download isn't the hard part.  The hard part will be getting those
messages into your local INN instance.  You'll need to (temporarily)
disable default protections which reject older articles.
These commands should be used before the beginning of the pulling. The
first one deactivates the reject of old articles, and the other ones
deactivate spam & abuse filtering.

ctlinnd param c 0
ctlinnd perl n
ctlinnd python n

After pullnews or suck have completed, then re-activate these protections:

ctlinnd param c 10
ctlinnd perl y
ctlinnd python y
--
Julien ÉLIE

« Quousque tandem ? » (Cicéron)
Grant Taylor
2023-03-14 18:03:14 UTC
Permalink
These commands should be used before the beginning of the pulling.  The
first one deactivates the reject of old articles, and the other ones
deactivate spam & abuse filtering.
    ctlinnd param c 0
    ctlinnd perl n
    ctlinnd python n
    ctlinnd param c 10
    ctlinnd perl y
    ctlinnd python y
Thank you for this information Julien. I'm copying it to my INN /
Usenet tips & tricks collection.
--
Grant. . . .
unix || die
Kerr Avon
2023-03-16 01:40:17 UTC
Permalink
Post by Grant Taylor
Thank you for this information Julien. I'm copying it to my INN /
Usenet tips & tricks collection.
I think I'd pay some good money or at least a few chocolate fish to read
those notes Grant :)
--
Agency News | news.bbs.nz
Grant Taylor
2023-03-16 03:53:18 UTC
Permalink
Post by Kerr Avon
I think I'd pay some good money or at least a few chocolate fish to
read those notes Grant :)
Chuckle.

Most of the things in the folder are related to establishing new peers
or Usenet software like INN & cleanfeed, my peering card, and the likes.
--
Grant. . . .
unix || die
Eli
2023-03-15 08:29:26 UTC
Permalink
On 14 Mar 2023 at 12:42:11 CET, "Julien ÉLIE"
Post by Julien ÉLIE
Hi San,
Post by Eli
If not, is there a way to download all existing articles from a
(commercial) news server via INN?
As said above, there are some tools that can be used to pull messages. I
believe that `suck` is one such tool.
Yes, suck (an external program) does the job.
https://www.eyrie.org/~eagle/software/inn/docs/pullnews.html
Hi Jullien,

Pullnews is exactly what I was looking for and it works like a charm.
Thank you very much for this.

Can multiple pullnews instances be launched side by side?
Or does this corrupt the INN databases?

Just a quick question about the settings in expire.ctl.
I never want the old messages from any newsgroup to be automatically deleted
(expired). Even though they are 20 years old.
I have the 'groupbaseexpiry' on 'false' (or is 'true' better?).
Is '0:1:99990:never' in expire.ctl the correct setting for this?
Julien ÉLIE
2023-03-16 08:21:11 UTC
Permalink
Hi Eli,
Post by Eli
Can multiple pullnews instances be launched side by side?
Yes, though you have to use a different set of newsgroups for each
instance. Otherwise, they would do the same thing and it won't run much
faster.

For instance:

pullnews -t 3 -c pullnews.marks1
pullnews -t 3 -c pullnews.marks2
...


with several groups in pullnews.marks1 and other groups in
pullnews.marks2. And run 2 instances side by side.
Post by Eli
Or does this corrupt the INN databases?
No, it won't corrupt anything.
Post by Eli
Just a quick question about the settings in expire.ctl.
I never want the old messages from any newsgroup to be automatically deleted
(expired). Even though they are 20 years old.
I have the 'groupbaseexpiry' on 'false' (or is 'true' better?).
Is '0:1:99990:never' in expire.ctl the correct setting for this?
With groupbaseexpiry set to false, I would use:

*:never:never:never

If you use CNFS buffers, make sure you have enough allocated space for
them (otherwise they will wrap, and articles will self-expire).
--
Julien ÉLIE

« Campagne électorale : c'est l'art de gagner les voix des pauvres avec
l'argent des riches en promettant à chacun des deux de les protéger
contre l'autre. » (Oscar Ameringer)
Eli
2023-03-17 14:22:15 UTC
Permalink
On 16 Mar 2023 at 09:21:11 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
Can multiple pullnews instances be launched side by side?
Yes, though you have to use a different set of newsgroups for each
instance. Otherwise, they would do the same thing and it won't run much
faster.
pullnews -t 3 -c pullnews.marks1
pullnews -t 3 -c pullnews.marks2
In the pulldown logs I see many of these lines:
x
DEBUGGING 55508 421
x
DEBUGGING 55509 421

What does this mean and what causes it?
After each such line it takes about 2 minutes until the next article is
downloaded and this slows down the download enormously.

I use debugging level 4.

Thanks.
Julien ÉLIE
2023-03-17 19:21:45 UTC
Permalink
Hi Eli,
Post by Eli
Post by Julien ÉLIE
pullnews -t 3 -c pullnews.marks1
pullnews -t 3 -c pullnews.marks2
x
DEBUGGING 55508 421
x
DEBUGGING 55509 421
What does this mean and what causes it?
After each such line it takes about 2 minutes until the next article is
downloaded and this slows down the download enormously.
It means that article numbers 55508 and 55509 were not found on the
server (x). My guess is that the connection has timed out (421 special
internal code).

Jesse reported a bug which sounds like that a few months ago.

Could you please download the latest version of pullnews, and try it please?

https://raw.githubusercontent.com/InterNetNews/inn/main/frontends/pullnews.in

Just grab that file, rename it without .in, and change the first 2 lines
to fit what your current pullnews script has (it is the path to Perl and
the INN::Config module).
Then you can run that script. It will work with your version of INN.
--
Julien ÉLIE

« Campagne électorale : c'est l'art de gagner les voix des pauvres avec
l'argent des riches en promettant à chacun des deux de les protéger
contre l'autre. » (Oscar Ameringer)
Eli
2023-03-17 23:42:15 UTC
Permalink
On 17 Mar 2023 at 20:21:45 CET, "Julien ÉLIE"
Post by Julien ÉLIE
Post by Eli
x
DEBUGGING 55508 421
x
DEBUGGING 55509 421
What does this mean and what causes it?
After each such line it takes about 2 minutes until the next article is
downloaded and this slows down the download enormously.
It means that article numbers 55508 and 55509 were not found on the
server (x). My guess is that the connection has timed out (421 special
internal code).
Jesse reported a bug which sounds like that a few months ago.
Could you please download the latest version of pullnews, and try it please?
https://raw.githubusercontent.com/InterNetNews/inn/main/frontends/pullnews.in
Just grab that file, rename it without .in, and change the first 2 lines
to fit what your current pullnews script has (it is the path to Perl and
the INN::Config module).
Then you can run that script. It will work with your version of INN.
Hello Julien,

The latest version seems to have fixed it.

Thank you
Julien ÉLIE
2023-03-18 07:18:39 UTC
Permalink
Hi Eli,
Post by Eli
The latest version seems to have fixed it.
Glad to hear it. Thanks for the confirmation the fix works. (It has
not been released yet; the INN 2.7.1 release is scheduled in April.)
--
Julien ÉLIE

« Sol attigit talos. »
Eli
2023-03-17 14:57:02 UTC
Permalink
Is it possible in pullnews to pre-skip articles above a certain number of
bytes, instead of downloading the whole article first?

Maybe by making a small change in the perl script?

Eli
Julien ÉLIE
2023-03-17 19:49:23 UTC
Permalink
Hi Eli,
Post by Eli
Is it possible in pullnews to pre-skip articles above a certain number of
bytes, instead of downloading the whole article first?
Currently not.
Post by Eli
Maybe by making a small change in the perl script?
I would suggest these additional lines:


@@ -928,6 +928,13 @@
push @{$article}, "\n" if not $is_control_art;
}
}
+
+ my $overview = $fromServer->xover($i);
+ # Skip the article if its size is more than 100,000 bytes.
+ if ($$overview{$i}[5] and $$overview{$i}[5] > 100000) {
+ $skip_article = 1;
+ }
+
if (not $skip_article
and (not $header_only or $is_control_art or
$add_bytes_header))
{


Before downloading the article, we just retrieve its overview data
(containing the size of the article at index 5 of the returned array),
amongst a few other fields.
Of course, change 100,000 to fit your needs.

I've quickly tested it, and I believe it works.

I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
--
Julien ÉLIE

« Le caramel est un invité du palais qui menace la couronne. » (Tristan
Bernard)
Jesse Rehmer
2023-03-17 20:27:59 UTC
Permalink
On Mar 17, 2023 at 2:49:23 PM CDT, "Julien ÉLIE"
Post by Julien ÉLIE
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
Hi Julien,

I would likely use it. :)

-Jesse
Julien ÉLIE
2023-03-17 21:19:35 UTC
Permalink
Hi Jesse,
Post by Jesse Rehmer
Post by Julien ÉLIE
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
I would likely use it. :)
:-)

OK, I'll see how to properly implement it.
The quick patch sends an "OVER n" for each article number (#n). I plan
on sending a global "OVER n-high" command to retrieve all the sizes at
once for a given newsgroup. It will save time because less
commands/answers will be sent.
--
Julien ÉLIE

« Sol attigit talos. »
Eli
2023-03-17 23:44:16 UTC
Permalink
On 17 Mar 2023 at 20:49:23 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
Is it possible in pullnews to pre-skip articles above a certain number of
bytes, instead of downloading the whole article first?
Currently not.
Post by Eli
Maybe by making a small change in the perl script?
@@ -928,6 +928,13 @@
}
}
+
+ my $overview = $fromServer->xover($i);
+ # Skip the article if its size is more than 100,000 bytes.
+ if ($$overview{$i}[5] and $$overview{$i}[5] > 100000) {
+ $skip_article = 1;
+ }
+
if (not $skip_article
and (not $header_only or $is_control_art or
$add_bytes_header))
{
I've quickly tested it, and I believe it works.
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
Hello Julien,

This also works perfectly.

Thanks again :)
Julien ÉLIE
2023-03-19 11:11:09 UTC
Permalink
Hi Eli and Jesse,
Post by Eli
Post by Julien ÉLIE
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
This also works perfectly.
I've just committed a proper patch, which will be shipped with INN
2.7.1. (You can grab it at the same URL as provided before.)


-L size

Specify the largest wanted article size in bytes. The default is to
download all articles, whatever their size. When this option is
used, pullnews will first retrieve overview data (if available) of
each newsgroup to process so as to obtain articles sizes, before
deciding which articles to actually download.



% ./pullnews -d 1 -L 1000
[...]

. DEBUGGING 656 -- not downloading article
<sr92nq$1ie38$***@news.trigofacile.com> which has 1230 bytes
x DEBUGGING 657 -- article unavailable 423 No such article number 657
. DEBUGGING 658 -- not downloading article
<ssjtm9$6u9s$***@news.trigofacile.com> which has 1042 bytes


And naturally, they are downloaded with a greater size specified in the
-L flag.

Hope it suits your needs :-)
--
Julien ÉLIE

« Il avait juste assez de culture pour faire des citations fausses. »
(Byron)
Jesse Rehmer
2023-03-19 14:53:05 UTC
Permalink
On Mar 19, 2023 at 6:11:09 AM CDT, "Julien ÉLIE"
Post by Julien ÉLIE
Hi Eli and Jesse,
Hope it suits your needs :-)
Thanks, Julien, I am giving it a shot, but I think I may have encountered a
new bug.

When running with -d 1, sometimes when I hit CTRL-C to stop the process it
wipes out the pullnews.marks file. It does not do this every time, but seems
like it is happening if I stop the process while it is retrieving overview
information.
Julien ÉLIE
2023-03-19 15:40:08 UTC
Permalink
Hi Jesse,
Post by Jesse Rehmer
When running with -d 1, sometimes when I hit CTRL-C to stop the process it
wipes out the pullnews.marks file. It does not do this every time, but seems
like it is happening if I stop the process while it is retrieving overview
information.
When you say "wipe out", does it mean you have an empty pullnews.marks
file? Or a pullnews.marks file with wrong article numbers?

Does it happen only with "-d 1"?
I'm unsure what could cause that, as I've not changed the way the
configuration file is handled :-/
It is saved when pullnews receives a SIGINT (Ctrl+C for instance), and
it writes the last article number processed.


I've tried to reproduce it with "-d 1", but do not see anything
suspicious in pullnews.marks. The last line in standard output is
"Saving config" after Ctrl+C.
--
Julien ÉLIE

« When a newly married couple smiles, everyone knows why. When a ten-
year married couple smiles, everyone wonders why. »
Jesse Rehmer
2023-03-20 16:02:15 UTC
Permalink
On Mar 19, 2023 at 10:40:08 AM CDT, "Julien ÉLIE"
Post by Julien ÉLIE
Hi Jesse,
Post by Jesse Rehmer
When running with -d 1, sometimes when I hit CTRL-C to stop the process it
wipes out the pullnews.marks file. It does not do this every time, but seems
like it is happening if I stop the process while it is retrieving overview
information.
When you say "wipe out", does it mean you have an empty pullnews.marks
file? Or a pullnews.marks file with wrong article numbers?
Does it happen only with "-d 1"?
I'm unsure what could cause that, as I've not changed the way the
configuration file is handled :-/
It is saved when pullnews receives a SIGINT (Ctrl+C for instance), and
it writes the last article number processed.
I've tried to reproduce it with "-d 1", but do not see anything
suspicious in pullnews.marks. The last line in standard output is
"Saving config" after Ctrl+C.
Here is what I'm seeing in a session that I did not kill, but was killed off
after the upstream host cut me off for time limit:

. DEBUGGING 361445 -- not downloading already existing message
<45af67cb$0$20803$***@news.tiscali.it> code=223
. DEBUGGING 361449 -- not downloading already existing message
<45af6a64$0$20803$***@news.tiscali.it> code=223
. DEBUGGING 361451 -- not downloading already existing message
<***@news.individual.net> code=223

Transfer to server failed (436): Flushing log and syslog files

When I start the command again:

[***@spool1 ~]$ pullnews -d 1 -O -c pullnews4.marks -L 200000 -t 3 -G
it.sport.calcio,it.sport.calcio.estero,it.sport.calcio.fiorentina,it.sport.ca
lcio.genoa,it.sport.calcio.inter
Mon Mar 20 11:00:14 2023 start

No servers!

[***@spool1 ~]$ cat pullnews4.marks
[***@spool1 ~]$
Julien ÉLIE
2023-03-20 19:07:23 UTC
Permalink
Hi Jesse,
Post by Jesse Rehmer
Here is what I'm seeing in a session that I did not kill, but was killed off
. DEBUGGING 361445 -- not downloading already existing message
. DEBUGGING 361449 -- not downloading already existing message
. DEBUGGING 361451 -- not downloading already existing message
Transfer to server failed (436): Flushing log and syslog files
Hmm, this log line does not correspond to a time limit enforced by the
upstream host. It is generated by the downstream server to which you
are sending articles. The "Flushing log and syslog files" message
appears during log rotation (INN is paused a very short moment).
Post by Jesse Rehmer
it.sport.calcio,it.sport.calcio.estero,it.sport.calcio.fiorentina,it.sport.ca
lcio.genoa,it.sport.calcio.inter
Mon Mar 20 11:00:14 2023 start
No servers!
Gosh!

Don't you have anything else after "Transfer to server failed (436):
Flushing log and syslog files"?
No "can't open pullnews4.marks" error?

I'm a bit surprised, the configuration file is saved this way:

open(FILE, ">$groupFile") || die "can't open $groupFile: $!\n";
print LOG "\nSaving config\n" unless $quiet;
print FILE "# Format: (date is epoch seconds)\n";
print FILE "# hostname[:port][_tlsmode] [username password]\n";
print FILE "# group date high\n";
foreach $server ( ... )
print [...]

close FILE;


You don't even have the "Saving config" debug line in your console, nor
the 3 initial # lines written in the new pullnews4.marks file...
Sounds like open() failed, or close() failed...

Could you try to add an explicit message error ?

close(FILE) or die "can't close $groupFile: $!\n";;



Don't you have in mind anything that could explain why the file couldn't
be written? (lack of disk space, wrong permissions on the file because
pullnews was not started with the right user, etc.)
--
Julien ÉLIE

« Avez-vous remarqué qu'à table les mets que l'on vous sert vous mettent
les mots à la bouche ? » (Raymond Devos)
Jesse Rehmer
2023-03-20 19:37:09 UTC
Permalink
On Mar 20, 2023 at 2:07:23 PM CDT, "Julien ÉLIE"
Post by Julien ÉLIE
Hi Jesse,
Post by Jesse Rehmer
Here is what I'm seeing in a session that I did not kill, but was killed off
. DEBUGGING 361445 -- not downloading already existing message
. DEBUGGING 361449 -- not downloading already existing message
. DEBUGGING 361451 -- not downloading already existing message
Transfer to server failed (436): Flushing log and syslog files
Hmm, this log line does not correspond to a time limit enforced by the
upstream host. It is generated by the downstream server to which you
are sending articles. The "Flushing log and syslog files" message
appears during log rotation (INN is paused a very short moment).
You are right, I have too many screen sessions running pullnews and mixed two
up.
Post by Julien ÉLIE
Post by Jesse Rehmer
it.sport.calcio,it.sport.calcio.estero,it.sport.calcio.fiorentina,it.sport.ca
lcio.genoa,it.sport.calcio.inter
Mon Mar 20 11:00:14 2023 start
No servers!
Gosh!
Flushing log and syslog files"?
No "can't open pullnews4.marks" error?
The output I provided was a copy/paste from the end of the session and the
commands I ran after.
Post by Julien ÉLIE
open(FILE, ">$groupFile") || die "can't open $groupFile: $!\n";
print LOG "\nSaving config\n" unless $quiet;
print FILE "# Format: (date is epoch seconds)\n";
print FILE "# hostname[:port][_tlsmode] [username password]\n";
print FILE "# group date high\n";
foreach $server ( ... )
print [...]
close FILE;
You don't even have the "Saving config" debug line in your console, nor
the 3 initial # lines written in the new pullnews4.marks file...
Sounds like open() failed, or close() failed...
Could you try to add an explicit message error ?
close(FILE) or die "can't close $groupFile: $!\n";;
Sure, may be a few days before I reply back with results!
Post by Julien ÉLIE
Don't you have in mind anything that could explain why the file couldn't
be written? (lack of disk space, wrong permissions on the file because
pullnews was not started with the right user, etc.)
Definitely no reason, plenty of space, no permission issues, I'm running
pullnews as the "news" user.
Eli
2023-03-20 10:51:48 UTC
Permalink
On 19 Mar 2023 at 12:11:09 CET, "Julien ÉLIE"
Post by Julien ÉLIE
Hi Eli and Jesse,
Post by Julien ÉLIE
I may add a dedicated option to pullnews and integrate it in a future
release, if that may prove to be useful for others.
I've just committed a proper patch, which will be shipped with INN
2.7.1. (You can grab it at the same URL as provided before.)
-L size
Specify the largest wanted article size in bytes. The default is to
download all articles, whatever their size. When this option is
used, pullnews will first retrieve overview data (if available) of
each newsgroup to process so as to obtain articles sizes, before
deciding which articles to actually download.
Thank you Julien and I will test the new version soon.

However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a
newsgroup
with a few million messages. I don't know if I like that.

But maybe I misunderstood you.
Tom Furie
2023-03-20 11:12:11 UTC
Permalink
Post by Eli
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a
newsgroup
with a few million messages. I don't know if I like that.
You need the overview to get the article size before downloading,
whether that's all at once or per article. I imagine it's more efficient
to get it all up front then filter out the unwanted articles.

Cheers,
Tom
Eli
2023-03-20 13:06:41 UTC
Permalink
Post by Tom Furie
Post by Eli
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a
newsgroup
with a few million messages. I don't know if I like that.
You need the overview to get the article size before downloading,
whether that's all at once or per article. I imagine it's more efficient
to get it all up front then filter out the unwanted articles.
Cheers,
Tom
It may be a bit more efficient, but I still see more disadvantages than
advantages. For example, if I want to change the max. file size or want to
change the regex for option -m, this is no longer possible after the entire
overview has already been downloaded and the filtering has already been
processed.
Julien ÉLIE
2023-03-20 18:38:39 UTC
Permalink
Hi Eli,
Post by Eli
Post by Tom Furie
Post by Eli
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a
newsgroup with a few million messages. I don't know if I like that.
You need the overview to get the article size before downloading,
whether that's all at once or per article. I imagine it's more efficient
to get it all up front then filter out the unwanted articles.
Indeed, downloading the overview in a unique command is globally faster
than article by article.

Suppose the group contains article numbers 1 to 1,000,000 and the last
time pullnews ran, it retrieved article 800,000.
Then on a new run, it will first ask the overview of articles 800,001 to
1,000,000 in a unique command, and it will get a unique (long) answer.
Then pullnews will actually download only articles known to be smaller
than the maximum size wanted.

Otherwise, the easiest way would be to retrieve overview data, article
by article. It will take globally more time, but I agree the user
experience is better as he does not have the feeling of a "hang" during
the download of the whole overview.
I could tweak it to download by bunches of 100 articles for instance,
but it's more work to do :( I may have to do so finally...
Post by Eli
It may be a bit more efficient, but I still see more disadvantages than
advantages. For example, if I want to change the max. file size or want to
change the regex for option -m, this is no longer possible after the entire
overview has already been downloaded and the filtering has already been
processed.
No, it will still be possible.
Overview data is downloaded each time you run pullnews. It does not
save it for later re-use if you "reset" the newsgroup in pullnews.marks.
It downloads overview data between the last retrieved article and the
last existing article in the newsgroup.
--
Julien ÉLIE

« Give laugh to all but smile to one,
Give cheeks to all but lips to one,
Give love to all but Heart to one,
Let everybody love you
But you love one. »
Jesse Rehmer
2023-03-20 18:58:59 UTC
Permalink
On Mar 20, 2023 at 1:38:39 PM CDT, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Tom Furie
Post by Eli
However, I have my doubts about the fact that this new version first downloads
the overview data of the entire newsgroup. That can take a while for a
newsgroup with a few million messages. I don't know if I like that.
You need the overview to get the article size before downloading,
whether that's all at once or per article. I imagine it's more efficient
to get it all up front then filter out the unwanted articles.
Indeed, downloading the overview in a unique command is globally faster
than article by article.
Suppose the group contains article numbers 1 to 1,000,000 and the last
time pullnews ran, it retrieved article 800,000.
Then on a new run, it will first ask the overview of articles 800,001 to
1,000,000 in a unique command, and it will get a unique (long) answer.
Then pullnews will actually download only articles known to be smaller
than the maximum size wanted.
Otherwise, the easiest way would be to retrieve overview data, article
by article. It will take globally more time, but I agree the user
experience is better as he does not have the feeling of a "hang" during
the download of the whole overview.
I could tweak it to download by bunches of 100 articles for instance,
but it's more work to do :( I may have to do so finally...
Maybe it would be sufficient to print a message indicating it is retrieving
overview data to inform the user of what is happening to account for the
pause?
Julien ÉLIE
2023-03-20 21:11:55 UTC
Permalink
Hi Eli and Jesse,
Post by Jesse Rehmer
Post by Julien ÉLIE
I could tweak it to download by bunches of 100 articles for instance,
but it's more work to do :( I may have to do so finally...
Maybe it would be sufficient to print a message indicating it is retrieving
overview data to inform the user of what is happening to account for the
pause?
It was easier to change than I thought.
Chunks of "progress width" articles at a time are now retrieved in
overview data. This corresponds to the value of -C (50 by default),
that is to say the number of "x", "." and like that are shown on one
progress line.
This way, downloading overview data does not take much time and is
almost unnoticeable by the user.

This improved version of pullnews is available at the same URL as
before. Thanks again for the feedback.
--
Julien ÉLIE

« Asseyez-vous une heure près d'une jolie fille, cela passe comme une
minute ; asseyez-vous une minute sur un poêle brûlant, cela passe
comme une heure : c'est cela la relativité. » (Einstein)
Eli
2023-03-21 13:49:16 UTC
Permalink
Hello Julien,

In some newsgroups I get the following error while using pullnews:

DEBUGGING 560 Post 436: Msg: <Can't store article>

Then pullnews quits.
Can this be avoided as it is very annoying.
Eli
2023-03-21 14:06:50 UTC
Permalink
Post by Eli
Hello Julien,
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
In the latest version of pullnews (the one from the link you posted earlier)
it quits with the error:

Transfer to server failed (436): Can't store article

It seems that this only happens with some old posted articles. But still very
annoying.

The new pullnews version is working great BTW. It is much faster than the
current one.
Julien ÉLIE
2023-03-21 19:18:32 UTC
Permalink
Hi Eli,
Post by Eli
Post by Eli
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
Do you happen to have other logs in <pathlog>/news.err or news.notice?
It would be useful to understand why innd did not manage to store the
article provided by pullnews. It is an unusual error. Do all the
newsgroups match an entry in storage.conf?
Post by Eli
In the latest version of pullnews (the one from the link you posted earlier)
Transfer to server failed (436): Can't store article
Didn't previous versions of pullnews report the same error?
Post by Eli
It seems that this only happens with some old posted articles. But still very
annoying.
Only old posts in some newsgroups? Do they have something special?
(article number > 2^31, unusual headers, etc.)
--
Julien ÉLIE

« Il n'y a que le premier pas qui coûte. » (Mme du Deffand)
Eli
2023-03-21 19:41:22 UTC
Permalink
On 21 Mar 2023 at 20:18:32 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
Post by Eli
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
Do you happen to have other logs in <pathlog>/news.err or news.notice?
It would be useful to understand why innd did not manage to store the
article provided by pullnews.
I don't see any reports in the other logs.
Post by Timo
It is an unusual error. Do all the
newsgroups match an entry in storage.conf?
Yes, I think this is enough and all I have:

method tradspool {
newsgroups: *
class: 0
}
Post by Timo
Post by Eli
In the latest version of pullnews (the one from the link you posted earlier)
Transfer to server failed (436): Can't store article
Didn't previous versions of pullnews report the same error?
Yes it did.
Post by Timo
Post by Eli
It seems that this only happens with some old posted articles. But still very
annoying.
Only old posts in some newsgroups? Do they have something special?
(article number > 2^31, unusual headers, etc.)
Here are 3 that I have on hand in a moment:

<***@127.0.0.1>
<***@216.113.192.29>
<***@sp6iad.superfeed.net>

If you cannot access the articles then let me know and I'll post the headers
here.
Julien ÉLIE
2023-03-21 21:04:47 UTC
Permalink
Hi Eli,
Post by Eli
If you cannot access the articles then let me know and I'll post the headers
here.
I've tried to inject the first one on my news server, and do not see any
problem... I don't know why it cannot be stored on yours.
(I've only added "trigofacile.test" to the list of newsgroups as I do
not carry alt.*)

235 Article transferred OK


X-Proxy-User: $$ch0zr$fsnj
Newsgroups:
alt.2600.qna,alt.2600.warez,alt.2600.414,alt.2600a,alt.2600hz,alt.266,alt.2d,alt.2eggs.sausage.beans.tomatoes.2toast.largetea.cheerslove,alt,alt.3d,trigofacile.test
Subject: An Easier Way To Make Money
From: ***@usenet.com
Date: Wed, 2 Mar 2005 09:49:51 GMT
X-Newsreader: News Rover 10.2.0 (http://www.NewsRover.com)
Message-ID: <***@127.0.0.1e>
Lines: 140
X-Comments: This message was posted through <A href
X-Comments2: IMPORTANT: Newsfeeds.com does not condone,
X-Report: Please report illegal or inappropriate use to
Organization: Newsfeeds.com http://www.newsfeeds.com 100,000+ UNCENSORED
Newsgroups.
Path: news.trigofacile.com!news-out.spamkiller.net!spool9-east!not-for-mail
Xref: news.trigofacile.com trigofacile.test:713

...
--
Julien ÉLIE

« Plus un ordinateur possède de RAM, plus vite il peut générer un
message d'erreur. »
Eli
2023-03-21 21:42:29 UTC
Permalink
On 21 Mar 2023 at 22:04:47 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
I've tried to inject the first one on my news server, and do not see any
problem... I don't know why it cannot be stored on yours.
(I've only added "trigofacile.test" to the list of newsgroups as I do
not carry alt.*)
235 Article transferred OK
I'll keep looking for a cause.
Thank you very much for your time. I really appreciate it.
Julien ÉLIE
2023-03-22 06:47:21 UTC
Permalink
Hi Eli,
Post by Eli
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
As it seems you are using the tradspool storage system, could you please
try:
scanspool -n -v

Though probably not related to overview, could you also try:
tdx-util -A
(if you're using tradindexed)
--
Julien ÉLIE

« Ce sont vos uniones, pas les miens ! » (Astérix)
Eli
2023-03-22 11:54:40 UTC
Permalink
On 22 Mar 2023 at 07:47:21 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
As it seems you are using the tradspool storage system, could you please
scanspool -n -v
tdx-util -A
(if you're using tradindexed)
Since these commands take quite a long time, I will wait with this until all
pullnews sessions are done and let you know.

Something else:
How can I reset a newsgroup that has already been fully downloaded so that
pullnews starts downloading all posts again?

Can this be done by:
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'

or is there a better way?

Thank again and apologies for all my questions.
Jesse Rehmer
2023-03-22 15:23:25 UTC
Permalink
Post by Eli
On 22 Mar 2023 at 07:47:21 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
As it seems you are using the tradspool storage system, could you please
scanspool -n -v
tdx-util -A
(if you're using tradindexed)
Since these commands take quite a long time, I will wait with this until all
pullnews sessions are done and let you know.
How can I reset a newsgroup that has already been fully downloaded so that
pullnews starts downloading all posts again?
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
Thank again and apologies for all my questions.
If you've already made a full pass over the group with pullnews and want to
make another full pass, I think the easiest is to modify the pullnews.marks
counts for that group and set to 1. That should cause pullnews to start from
the beginning.
Eli
2023-03-22 18:54:08 UTC
Permalink
On 22 Mar 2023 at 16:23:25 CET, "Jesse Rehmer"
Post by Jesse Rehmer
Post by Eli
On 22 Mar 2023 at 07:47:21 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
Transfer to server failed (436): Can't store article
I'll keep looking for a cause.
As it seems you are using the tradspool storage system, could you please
scanspool -n -v
tdx-util -A
(if you're using tradindexed)
Since these commands take quite a long time, I will wait with this until all
pullnews sessions are done and let you know.
How can I reset a newsgroup that has already been fully downloaded so that
pullnews starts downloading all posts again?
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
Thank again and apologies for all my questions.
If you've already made a full pass over the group with pullnews and want to
make another full pass, I think the easiest is to modify the pullnews.marks
counts for that group and set to 1. That should cause pullnews to start from
the beginning.
Hi Jesse,

That's what I thought at first too, but this prevents all existing files in
the spool from being downloaded again and all messages are treated as
'-- not downloading already existing message'.

My question is therefore how you can completely reset a newsgroup so that
everything is downloaded again.

This in particular for the newsgroup 'news.lists.filters'. This group contains
the references to the 'spam' messages that NoCem then deletes. I want to reset
this newsgroup 'news.lists.filters' so that all messages are checked locally
again and in case of spam removed.

But besides this newsgroup I also want to reset other newsgroups.
I hope this is possible.
Julien ÉLIE
2023-03-22 21:16:46 UTC
Permalink
Hi Eli,
Post by Eli
Post by Jesse Rehmer
If you've already made a full pass over the group with pullnews and want to
make another full pass, I think the easiest is to modify the pullnews.marks
counts for that group and set to 1. That should cause pullnews to start from
the beginning.
That's what I thought at first too, but this prevents all existing files in
the spool from being downloaded again and all messages are treated as
'-- not downloading already existing message'.
My question is therefore how you can completely reset a newsgroup so that
everything is downloaded again.
Ah yes, that's a bit tricky as what you want is to remove all traces of
articles in spool, overview and history.

The proper method would be to:

- ctlinnd rmgroup xxx
- remove the <pathspool>/articles/.../xxx directory of the group
- set /remember/ to 0 in expire.ctl
- run the expireover and expire process (for instance via news.daily
called with the same parameters as in crontab, plus "notdaily")
- undo the change in expire.ctl (/remember/ set to 11)
- ctlinnd newgroup xxx
- reset the last downloaded article in pullnews.marks for this group
- deactivate Perl and Python filters, and set the artcutoff to 0
- run pullnews
- reactivate the filters, and artcutoff to 10


I think INN will happily accept to be refed of these articles.
Post by Eli
This in particular for the newsgroup 'news.lists.filters'. This group contains
the references to the 'spam' messages that NoCem then deletes. I want to reset
this newsgroup 'news.lists.filters' so that all messages are checked locally
again and in case of spam removed.
As for NoCeM, you can directly refeed your notices to perl-nocem without
resetting anything.

perl-nocem expects storage tokens on its standard input.
Example:

echo '@020162BEB132016300000000000000000000@' | perl-nocem

As you're running tradindexed overview, I would suggest to have a look
at the output of:

tdx-util -g -n news.lists.filters

It dumps the overview data of this newsgroup. The last field is a
storage token.
You could replay NoCeM notices with these information :)
--
Julien ÉLIE

« Ta remise sur pied lui a fait perdre la tête ! » (Astérix)
Jesse Rehmer
2023-03-22 21:23:48 UTC
Permalink
On Mar 22, 2023 at 4:16:46 PM CDT, "Julien ÉLIE"
Post by Julien ÉLIE
Post by Eli
This in particular for the newsgroup 'news.lists.filters'. This group contains
the references to the 'spam' messages that NoCem then deletes. I want to reset
this newsgroup 'news.lists.filters' so that all messages are checked locally
again and in case of spam removed.
As for NoCeM, you can directly refeed your notices to perl-nocem without
resetting anything.
perl-nocem expects storage tokens on its standard input.
As you're running tradindexed overview, I would suggest to have a look
tdx-util -g -n news.lists.filters
It dumps the overview data of this newsgroup. The last field is a
storage token.
You could replay NoCeM notices with these information :)
Very nice, this is valuable to me as well. I know I will be doing this when I
am done gathering articles from all other the place. :-)
Eli
2023-03-22 21:34:02 UTC
Permalink
On 22 Mar 2023 at 22:16:46 CET, "Julien ÉLIE"
<***@nom-de-mon-site.com.invalid> wrote:

Hey Julien,
Post by Timo
Hi Eli,
Post by Eli
That's what I thought at first too, but this prevents all existing files in
the spool from being downloaded again and all messages are treated as
'-- not downloading already existing message'.
My question is therefore how you can completely reset a newsgroup so that
everything is downloaded again.
Ah yes, that's a bit tricky as what you want is to remove all traces of
articles in spool, overview and history.
- ctlinnd rmgroup xxx
- remove the <pathspool>/articles/.../xxx directory of the group
- set /remember/ to 0 in expire.ctl
- run the expireover and expire process (for instance via news.daily
called with the same parameters as in crontab, plus "notdaily")
- undo the change in expire.ctl (/remember/ set to 11)
- ctlinnd newgroup xxx
- reset the last downloaded article in pullnews.marks for this group
- deactivate Perl and Python filters, and set the artcutoff to 0
- run pullnews
- reactivate the filters, and artcutoff to 10
I think INN will happily accept to be refed of these articles.
Post by Eli
This in particular for the newsgroup 'news.lists.filters'. This group contains
the references to the 'spam' messages that NoCem then deletes. I want to reset
this newsgroup 'news.lists.filters' so that all messages are checked locally
again and in case of spam removed.
As for NoCeM, you can directly refeed your notices to perl-nocem without
resetting anything.
perl-nocem expects storage tokens on its standard input.
As you're running tradindexed overview, I would suggest to have a look
tdx-util -g -n news.lists.filters
It dumps the overview data of this newsgroup. The last field is a
storage token.
You could replay NoCeM notices with these information :)
Ah cool :)

Exactly what I needed.
Thanks so much.

Another question, is it possible to limit the maximum number of connections
per authenticated user? I know this is possible for peers, but can this also
be set up for authenticated users? Maybe a setting in readers.conf or nnrpd
that I'm overlooking?
Julien ÉLIE
2023-03-22 21:49:55 UTC
Permalink
Hi Eli,
Post by Eli
Another question, is it possible to limit the maximum number of connections
per authenticated user? I know this is possible for peers, but can this also
be set up for authenticated users? Maybe a setting in readers.conf or nnrpd
that I'm overlooking?
Unfortunately, the response is no. There's no native way of limiting
users' connections.
You may want to write a custom authentication hook (perl_auth or
python_auth in readers.conf) that would do the job by accounting how
many connections are open by a given user, and deny access if it exceeds
the limit. I am not aware of existing scripts to do that :-(

It could be worthwhile having though, as you're not the first one to ask
(but nobody wrote or shared what he came up with).
--
Julien ÉLIE

« Ta remise sur pied lui a fait perdre la tête ! » (Astérix)
Jesse Rehmer
2023-03-22 22:25:03 UTC
Permalink
On Mar 22, 2023 at 4:49:55 PM CDT, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
Another question, is it possible to limit the maximum number of connections
per authenticated user? I know this is possible for peers, but can this also
be set up for authenticated users? Maybe a setting in readers.conf or nnrpd
that I'm overlooking?
Unfortunately, the response is no. There's no native way of limiting
users' connections.
You may want to write a custom authentication hook (perl_auth or
python_auth in readers.conf) that would do the job by accounting how
many connections are open by a given user, and deny access if it exceeds
the limit. I am not aware of existing scripts to do that :-(
It could be worthwhile having though, as you're not the first one to ask
(but nobody wrote or shared what he came up with).
It's pretty simple to run nnrpd via other utilities that will do the limiting
for you, though, most UNIX/Linux systems have at least two or three tools to
accomplish more or less the same thing.

That said, it would be nice to have that ability directly in nnrpd.
Julien ÉLIE
2023-03-22 22:35:44 UTC
Permalink
Hi Jesse,
Post by Jesse Rehmer
Post by Julien ÉLIE
Unfortunately, the response is no. There's no native way of limiting
users' connections.
It's pretty simple to run nnrpd via other utilities that will do the limiting
for you, though, most UNIX/Linux systems have at least two or three tools to
accomplish more or less the same thing.
That said, it would be nice to have that ability directly in nnrpd.
Ah yes, exactly. That's the reason why this was never implemented in
INN. It's not seen as a priority at all, and it's also not trivial to do.

Issue #23
"nnrpd currently has no way of limiting connections per IP address other
than using the custom auth hooks. In its daemon mode, it could in
theory keep track of this and support throttling. It's probably not
worth trying to support this when invoked via inetd, since at that point
one could just use xinetd and its built-in support for things like this.

When started from innd, this is a bit harder. innd has some basic rate
limiting stuff, but nothing for tracking number of simultaneous
connections over time. It may be fine to say that if you want to use
this feature, you need to have nnrpd be invoked separately, not run from
innd."


So the answer is to use something like "per_source = 5" in xinetd.conf
and start nnrpd by xinetd.
--
Julien ÉLIE

« Nous autres communistes, nous avons une position claire : nous n'avons
jamais changé, nous ne changerons jamais et nous sommes pour le
changement. » (Georges Marchais)
Grant Taylor
2023-03-23 00:01:52 UTC
Permalink
There's no native way of limiting users' connections.
It's not per user in that you could have multiple users per IP, but I'd
think seriously about doing this at the firewall such that each IP can
only have a limited number of connections.
--
Grant. . . .
unix || die
Julien ÉLIE
2023-03-22 18:38:07 UTC
Permalink
Hi Eli,
Post by Eli
How can I reset a newsgroup that has already been fully downloaded so that
pullnews starts downloading all posts again?
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
+1 for Jesse's way.

I have a question about these ctlinnd rmgroup/newgroup commands. Do you
happen to have already used them to "reset" a newsgroup?
It would explain the "Can't store" errors if you also did not purge the
tradspool files in <pathspool> for some newsgroups. Files named with
article numbers "1", "2", "3", etc. will still be present in your spool.
If you recreate a newsgroup with ctlinnd rmgroup/newgroup, it just
recreate it in the active file, without wiping the spool. Article
numbering is reset to 1, and INN will try to store articles in already
existing "1", "2", etc. files.
--
Julien ÉLIE

« Il ne faut jamais parler sèchement à un Numide. » (Astérix)
Eli
2023-03-22 18:56:56 UTC
Permalink
On 22 Mar 2023 at 19:38:07 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
How can I reset a newsgroup that has already been fully downloaded so that
pullnews starts downloading all posts again?
1) 'ctlinnd rmgroup newgroup'
2) 'ctlinnd newgroupgroup'
or is there a better way?
+1 for Jesse's way.
I have a question about these ctlinnd rmgroup/newgroup commands. Do you
happen to have already used them to "reset" a newsgroup?
It would explain the "Can't store" errors if you also did not purge the
tradspool files in <pathspool> for some newsgroups. Files named with
article numbers "1", "2", "3", etc. will still be present in your spool.
If you recreate a newsgroup with ctlinnd rmgroup/newgroup, it just
recreate it in the active file, without wiping the spool. Article
numbering is reset to 1, and INN will try to store articles in already
existing "1", "2", etc. files.
Hi Julien,

No, unfortunately I didn't.
I have not deleted or reset anything at all.
The problem still occurs intermittently and I don't understand why.
Eli
2023-03-22 22:08:35 UTC
Permalink
On 21 Mar 2023 at 20:18:32 CET, "Julien ÉLIE"
Post by Timo
Hi Eli,
Post by Eli
DEBUGGING 560 Post 436: Msg: <Can't store article>
Then pullnews quits.
Can this be avoided as it is very annoying.
Do you happen to have other logs in <pathlog>/news.err or news.notice?
It would be useful to understand why innd did not manage to store the
article provided by pullnews. It is an unusual error. Do all the
newsgroups match an entry in storage.conf?
Hi Julien,

I probably found the problem.
The errlog gives the following error:

==
innd: tradspool: could not symlink
/usr/local/news/spool/articles/alabama/politics/11365 to
/usr/local/news/spool/articles/alt/2600/414/78: Not a directory
==

/usr/local/news/spool/articles/alt/2600/414 is a file, but for some reason
INND wants to create a folder in that path with the same name as the file
name.

Any ideas how this is possible and how to fix?
Julien ÉLIE
2023-03-22 22:24:07 UTC
Permalink
Hi Eli,
Post by Eli
I probably found the problem.
==
innd: tradspool: could not symlink
/usr/local/news/spool/articles/alabama/politics/11365 to
/usr/local/news/spool/articles/alt/2600/414/78: Not a directory
==
/usr/local/news/spool/articles/alt/2600/414 is a file, but for some reason
INND wants to create a folder in that path with the same name as the file
name.
Any ideas how this is possible and how to fix?
Ah, OK, I understand.
The article I tested last day was posted to a newsgroup named
"alt.2600.414". It did not produce any error on my news server because
I do not have a newsgroup named alt.2600.

When you are using tradspool, there's a conflict between the 414th
article in the newsgroup alt.2600 and any article in the newsgroup
alt.2600.414.
It is how the tradspool storage method works.

Such newsgroups should not be used. FYI, an excerpt of the naming
convention of newsgroups:

A <component> SHOULD NOT consist solely of digits and SHOULD NOT
contain uppercase letters. Such <component>s MAY be used only to
refer to existing groups that do not conform to this naming scheme,
but MUST NOT be used otherwise.

NOTE: All-digit <component>s conflict with one widely used storage
scheme for articles. Mixed-case groups cause confusion between
systems with case-sensitive matching and systems with case-
insensitive matching of <newsgroup-name>s.



How to fix it?
Well, you can't with your current storage method.
You would have to switch to another method (to set in storage.conf) like
CNFS, timecaf or timehash. All these three methods will be able to
store articles for such groups.
You could keep tradspool for all the newsgroups except for the
problematic ones if you want (you would have to explicitly list them in
storage.conf).
--
Julien ÉLIE

« Nous autres communistes, nous avons une position claire : nous n'avons
jamais changé, nous ne changerons jamais et nous sommes pour le
changement. » (Georges Marchais)
Loading...